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CHAPTER 1 
INTRODUCTION 

Probloms of. data aggrGgation have Important implications for 
educational research utilizing data froni groups of individuals. This 
Investigation considers the consequences of '^change in the units of 
analysis" where relations arnong individuals are to be inferred from 
grouped data* In Chapter 1 five research problems are discussed where 
an investigator might attempt to translate properties and relations 
from one level of grouping to another. A general strategy is described 
for analyzing the conditions under which grouped data can be used for 
inferences about individuals. 

1, T e rmino 1 o gy 

Data aggregation denotes the replacement of a set of numbers by a 
smaller set of numbers or "aggregates"* This term occurs repeatedly in 
the literature of economics and econometrics | macroeconomic theory is 
based mainly on aggregated measurements. 

Whenever distinct measures are combined ^ aggregation is Involved* 
In a study of foods, products such as bananas and oranges can be 
combined into the category "fruits'-. A single aggregate Index such as 
per capita consumption ot all fruits can replace separate consumption 
Indices for bananas and oranges. In an educational content ^ the mean 
aptitude score attributed to a school is an aggregate of the scores of 
its students* 

Measurements can be aggregated within a person as well as between 
persons. In observational studies, the observation period is usually 



divided into time intervals* Buhavior during a given time interval is 
represented by either the total or average number of occurrences of the 
behavior in that period. Such a score is an aggregate of instantaneous 
observations* 

Heres groupingof observations or, simply , grouping s will refer to 
the aggregation of nieasurements over individuals (as distinct from 
aggregation over time periods or commodities)* More specifically^ 
grouping is the replacement of numbers representing observations on 
individuals with a smaller set of numbers representing observations 
aggregated over a group of the individuals* An eKample is the formation 
of school means from the achievement of students. The terms aggregation ^ 
grouping , and grouping of observations will henceforth fc*i used inter^ 
changeably * 

II, Inferences Involving Change in the_ Units of Analysis 

Grouped data are common In the social sciences. Sociologists focus 
on relations among collections of individuals* Educational researchere 
often use the classroom or school as the sampling unit and analyze 
between-^class and between-school relations. The study of grouped data 
introduces no special obstacles when inference is restricted to the 
level at which the data are analyzed* If a study concerns the relation 
between the academic and social psychological climates of the school ^ 
the school^aggregated achievement and attitude indices are the data to 
relate. 

On the other hand/ educational and psychological researchers arc 
usually concerned with relations among measuraments on individuals. 
The investigator may wish to determine the relation between student 
aptitude and student achievement or between parents' education and 
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student aspirations, ThcsG ineasurcments cannot always be examined at 
the individual IgvgI, possibly because those data are not obtainable or 
idantj.f iable for each person^ or perhaps because of high cost of analysis. 

Facing such problems , some investigators turn to data on groups to 
estimate regression and correlation coefficiGnts at the individual level. 
Their conclusions eKtend the results of the analyses at the group level 
to the relations among indi%Tiduals • 

However p complications arise in translating properties and relations 
from one level to another (Riley ^ 1964; Robinson ^ 1950; Schauch, 1966; 
Thell, 1954; Thorndika, 1934). These complications are discussed under 
the general label '^change in the units of analysis". Where this problem 
arises, the investigator wishes to apply relations observed among units 
at one level of aggregation to units at a different level (Blalock, 
1964)* The dlrectlnn of inference can go toward larger aggregates , such 
as states or nations, or toward smaller onpj — the smallest being the 
single person. 

Our concern is with research where relations at the individual 
level are of interest, but data are aggregated over Individuals according 
to some specifiable grouping rule*^ The criterion for grouping can be 
alinost any characteristic of the individuals* Grouping can even be 
randonu The choice of grouping procedure is dlctatad by the data on 
hand and the usefulness of a specific procedure for estimating Indivl" 
dual-level relations from these data* 

^Terms such as "grouping procedure", "grouping method" i "grouping rule", 
"grouping technique", and "grouping strategy" will be used inter-- 
changeably in referring to the formation of groups, "Grouping 
characteristic" and "grouping criterion" will refer to the character- 
istic (s) from which the group classifications are determined. The 
actual classification scheme i^rhich assigns observations to groups will 
be called the "grouping variable", 
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III, Research Problems Involvin g Chan ge In Units 

We uexU describe five research problems in which grouped 
observacioiis are used in esUimfiUing relations among ineasurements on 
individuals* Missing observations , fallibly measured variables, 
economy of analysis , anonyiiiously collected information j and ecological 
inference all create problems that data aggregation can alleviate to 
some degree* 

The degree of investigator control over the aggregation of data 

i 

is an important consideration* In certain contekts group inembership 
is determined in some natural ways e-.g.j school attended or census 
tract. It is thus beyond the investigator *s control except for the 
eKclusion of sampling units and individuals Climited or no investigator 
control) , In other contexts the investigator can manipulate the 
formation of groups (completely or partially) . There are generally 
more options for improving estimation in the latter contexts. In 
Table 1,1 we indicate the degree of investigator ccntrol over the 
formation of groups for each prablem. IsHiy the methods of data 
aggregation are useds how such methods are applied, and where they are 
principally applied are also discussed, 
A* Missing Observations 

Missing data are to be expected whenever an inves'::igator collects 
a large amount of information or uses a large number of subjects* 
Missing observations are particularly likely in longitudinal studies. 
Thus, if student achievement is assessed on three occasions ^ a 
particular student may miss one or more testing periods. The 
Investigator must then decide how to treat this hiatus in estimating 
the relations among the tests, or the relations of the tests to other 
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Besiarch Context 



Raisons for Data 
^ggregatioii_ 



Description of Application 



Principal 
Application 



L Coffiplete Investigator Contiol — Group fflerabership. can be definsd by any characteristic in the data sat 
which is niisured for air individuals * 



filSSING 

OBSERVATIONS , 


MlBsing QbisrvAtions on 
priiiiary variiblss for sose 
indivlduils inhibit 
conf idenct in analytical 
rsiulti, 


Each flii^Bini! QbssfVAtion on 
a priMry viriiblt is replaced : 
by the mean responii on that 
variable ill S0S18 group to which 
the individuil biiengii 


LonstitudinEl and 
crois-sectional 
analysis of survay 
data. 


FALLIBLY fflASllD 
VARIABLES 


Random errors of ■ ^ 
leasureinent: associatid . 


, Dlfffrent approachii-have bee^ 
luggested as part of the 


Statistical . 
treatasnt of 
illsurenient err on 




with indipendsnt 
variables attinuati 
rigression coefficisnts. ^ 


leneral rifinanisnt of itatisti- 
Cil procidurss for handling 
■'errors^in-^iriibliS"' problinis* 


ECOHOMY OF 
ANEYSIS 


Budgetary constraints make 
analysis of iissivi data 
bases at the individual 


Data are collapsed into i 
Sfflallar nrabar of units by 
Bom grouping rule. 


Analyiis of ctnsua 
..'data and natloial, 
riglonalj and stati 



livii impraeticaL 



sehool statistics. 



11, Partial Invgitigator totrol — Group meibership can be defined by any charictariitic which has been 
ffiiasurid siiuitaneously with each primry viriabl^ 



(D) ANONYMOUSLY 
r COLLECTED 
- INFORMATION 



Data on cirtaln prin&ry 
variables are collictid 
ino'nyiouilyj Bikini it 
lipossible - to-iatch-— 



observations on primary 
variables at the individuil 
level, ' 



ChatacteriiticS; liisured 
siiultaniously 'with . the 
inonyDously .collictid- 
'inforniatlon can bs used to 
iggregati the data, ' 



Confidential 
student records 'and 
risponiis to > 
attitudinal 
quistlonnaires. 



Table LI. (continuid), Risiircli prflbieffls involving data aggragation. 



kmm for Data ' Principal 
Resiareh Context • Aggreptlon Descriptlgn" of Application Application 

III. Liinitid or 1q Invistlgator Csntrol -- Group iifflhirshlp is detsrniined prior to Ebe collecEioti and 
analysis of data; group Mbarship is dirsctly pirtinint to the study of priiiry viriablis, 

(E) ECOLOGICAL The sampling units of the Diiaggregation efforts are Analysii of school 

INFERENCE ■ investigation constitute generally a necessary and claisiosin nans 

"natural" aggregates of precondition to reasonibli where the school' and 

' individuals. inferences it the individual the class are the ' 

level. sanipling units; 

data organised by 
census tract or 
deiopaphic region. 



vflrlabies (school charaGterlstic^ , 'teacher characteristics , home 
etiv irenmant , etc • ) . 

Many investigators simply drop from the data set any individual who 
lacks information on any study variable. Howeverj Elashoff and Elashoff 
(1971 J p* 1) find that "techniques such as case deletion which assume 
that observations are missing at random may be extremely misleading. If 
the probability model governing the occurrence of missing data is complex 
the only adequate solution may be to find out what the missing observa^ 
tlons are"» 

Some investigators use the mean of the overall sample or the mean 
of some subgroup to which the individual belongs as an estimate of the 
missing observations- This "replace-'Wit I'-the^mean" strategy is somewhat 
akin to the adjustments made in factorial analysis of variance (ANOVA) 
eKperiments where a missing observation (X... ) is estimated by the 
mean of the ijth cell (X^^^)* 

The replace-wlth"the-mean strategy is a use of aggregated data. 
For example^ Kline, Kent^ and Davis (1971) , investigating the political 
instability of nations s replaced missing observations on stability and 
literacy with means. These means 'were estimated from subgroups of 
nations grouped by variables measured on all nations (date of indepen- 
dences location^ political modernization). So each nation with missing 
.data on stability is assigned the mean stability score estimated from 
its subgroup on, say, date of independence. 

The utility of replacing missing observations with group means 
depends on the variables under study* The estimates generated are 
functions of the properties of the grouping characteristics — their 
internal distributional properties and their relations with the study 
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variables, 4 good astiinate of the actual relations can be obtained be- 
cause infonnation relevant to the problem has been retained and Informa-- 
tion loss thereby reduced * . On the other hand, as with case deletion » 
certain qiuestions remain* In fact^ the treatment of the missing data can 
be complicated as well as simplified by this particular grouping strategy 
B* Fallibly Measured Variables.^ 

It is well tao™ that estimates of regression coefficients are at-- 
tenuated by random errors in the independent variables. Let be 
the regression coefficient where X is the observed independent variable 
and let represent the reliability of the measurement of , the 

peraon-s true score on the independent variable* The usual procedure is 
to use SyX^^X' rather than g^x to estimate * 

Madansky (1959) reviewed in detail the literature on the fitting of 
straight lines when both variables are subject to error* He discussed 
several grouping techniques that were proposed to handle problems arising 
from an imperfectly measured independent variable in regression analysiB* 
Methods developed by Wald (1940) and Bartlett (1942) are perhaps the 
most familiar* 

Recently s Blalock (1964; 1970) has reconsidered the Wald-Bartlett 
techniques and has advanced his own plan for using grouping with Imper- 
fectly measured variables* He recommends that the investigator group 
on an "instrument^% a variable which (1) affects the "true" independent 
variable 5 and (2) does not directly affect the dependent variable. The 
relationships Orf interest are then estimated from the grouped observa-* 
tions * 

Both the Wald-Bartlett and BlalocU groiiping techniques are based 
on the principle that measurement errors tend to cancel out within 
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groups if the grouping charactarlstic is highly related to the "true" 
values of the independent variable but is uncorrelated with the 
measurement errors. Under these conditions , the error portion of the 
observed variance In the independent variable decreases \ahen group 
means are used| especially as the size of the groups becomes large. 
Thus the reliable portion o£ the variation Increases through grouping, 
and the regression estimates are in part disattenuated. 

The merits of the approaches suggested by Waldj Bartlettj Blalockj 
and others will not be debated here. However j their work suggests that 
the grouping of observations may be one way to resolve certain 
measurement difficulties # 
C- The Economy of Analysis 

Grouping may be prescribed when there is an overabundance of 
relevant data, and the budget for analysis is limited* For eKamplep 
costs may prevent use of the complete data from the California State 
Testing Program in relating minority status to achievement. The 
analyst may choose to sample districts j or to carry out a between- 
districts analysis. The latter analysis involves a change of units if 
the investigator then makes interpretations at the individual level, 

Econometricians have already developed sound principles for 
grouping where economy of analysis is the chief concern (Prais and 
Aitchinsons 1954; Cramer, 1964). The resulting loss of efficiency has 
been only a few percent in the cases economists typically examine* 

The successful use of aggregation in this context can be largely 
attributed to the investigator's ability to choose the aggregating 
variable. In most cases where economy of analysis is the concern , the 
investigator can choose which characteristic(s) will define the groups 
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of students s be it seK» classroom, school, or some other measure. The 
investigator can secure meaningful estimates from aggregated data by 
choosing a grouping characteristic whose relations with the study 
variables best satisfy the conditions of efficient grouping, 

D. Anonymously Collected Information 

It is usually impossible to match data collected anonymously with 
identified information on otheir variables on the same persons. For 
example, student achievement cannot be compared with attitude when 
responses to the attitude questionnaires are anonymous* If, however, 
some auKillary information about individuals can be collected along 
with the attitude questionnaire^ partial icfentif ication by group 
membership can sometimes lead to accurate and efficient estimation of 
the relations between attitudes and achievement (Boruchp 1971; Feige 
and Watts, 1970; 1972)* These estimates may be obtained from grouping 
procedures similar to those used In conteKts where the investigator has 
complete control over the choice of grouping procedure. For example, 
students can be grouped by county of. residence; then the between-^county 
regression of student attitude on student achievement cpn be used as an 
estimate of the individual-level regression. Or the student could be 
asked to Indicate the second letter of his last name* Wiat auxiliary 
Information is suitable depends on the study conditions, but a "good" 
grouping technitiye has certain general properties. Once these 
properties are known, the investigator can build "good" grouping 
characteristics into his study design, 

E, Ecological Inference ~ Aggregate Sampling Units 

It is not uncommon to sample aggregates of Individuals rather than 
the individuals themselves. For example, every studsKt in a classroom 
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can be studied rather than students selectad Individually from the 
student body. Scores can be obtained from student bodies of schools and 
colleges i and between»-school and between^cpllege relations analyzed, 
City^ county J and censui tract means can be the sampling units in 
sociological and economic studies* 

Inferences drawn from aggregate sampling units can lead to what has 
been called the "ecological fallacy" (Robinson, 1950)* The "ecological 
fallacy" is the practice of inferring relations between properties of 
individuals from the relations of group data (Alkerj 1969; Selvins 1958)* 
Although sociologists and political scientists beginning with Robinson 
have discussed "ecological inference", the writers in the educational 
and psychological literature have often overlooked the issue *^ 

When sampling units are groups of individuals » between^group 
analysis is logical even when the relations among measurements on indi^ 
viduals are the primary concern. The investigator lacks control over 
group membership and thus cannot select a suitable grouping procedure as 
in other contexts. In many instances , he is unable to determine how the 
required grouping procedure affects the variation and covariation of the 
study variables. Under these conditions ^ the possibility of inferring 
relations at the individual level is limited* ' 

In any case, the sapling of groups can present a particularly 
complex type of aggregation problem ^ since questions regarding sampling 

-Oddly enough p one of the first references to the inflationary effects 
of estimating correlation coefficients from grouped data was by the 
eminent psychologist E,L* Thorndike (1934)* There appear to be no 
further coimnents on the topic from educational and psychological 
researchers except the papers questioning the appropriateness of 
estimating individual learning curves from grouping learning curves* 
(e*g. Estes, 1956) * 
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bias arise in addition to concerns about lovt:! of infarance. One 
question may be whether the sampled classrooins (counties) are represen- 
tative of the classrooms (counties) in the universe to which one wants 
to generalize. The investigator must clearly understand the basis for 
his inferences to the individual level in order to be at all confident. 
Otherwise, it may be best to make inferences at the group level or to 
examine the individuals within groups ^ or to do both. 
F. Applicability of Crouplng Scheme in Different Contexts 

This investigation offers a general scheme for identifying the 
consequences of grouping. This scheme will enable an investigator to 
choose the best grouping characteristics from a larger set when informa- 
tion about interrelations of each grouping characteristic and the study 
variables is knom. Thus, our results are most applicable to problems 
(A) through (D) where the investigator has at least partial control over 
the aggregation procedure. The ordered grouping characteristics that 
can occur in these contexts are also easier to handle since the deter- 
mination of the relations of ordered characteristics to the study 
variables is straightforward. 

The extra difficulties of grouping when some data are collected 
anonymously [problem (D) ] largely arise from the inability to group on 
certain primary variables. It is best to group on the independent 
variable in a regression analysis ^ but this is not possible when 
observations on independent and dependent variables cannot be linked. 
The general scheme will offer suitable alternative procedures in this 
context that approximate the optimal grouping method. 

The problems of ecological inference [problem (E) ] are the most 
complex because there is 'no choice o£ grouping procedure and also 
because the observable grouping characteristic usually has a nominal 



13 

scale. Our^^scheme offers little direct guidance on how to proceed in 
this conteKt, though it will usually indicate when Inferences about 
individual relations are out of the question. However , the conditions 
necessary to determine when such Inferences are reasonable are unlikely 
to occur unless the analysis can be carried out at the individual level* 
If individual-level analyses are poaaiblei ecological inference is 
usually unnecessary. 

The analytital arguments will be restricted mainly to the 
conditions prevailing when the investigator can choose among several 
ordered grouping characteristics [problems (A) through (D) ] • Our 
hypothetical examples and empirical analyses will refer mainly to 
problems of economy of analysis [problem (C) ] and of anonymously 
collected information [problem (D)]* Application in other contents 
will be indicated where appropriate. 

IV. Problems to be Considered 

This inquiry focuses on how grouped data can be used for inferences 
about individuals particularly in educational research. The problems 
discussed in the previous section affirm the need for a clear under- 
standing of this technique. We cannot specify the problems eKactly 
until the technical terminology and notation are developed ^ but we can 
identify previously unsettled issues to be considered* 

Regression and correlation coefficients calculated from groiiptS 
data may be biased estimates of the corresponding individual^level 
relations. Robinson (1950) showed that the bias in such correlation 
coefficients is strongly determined by the ratios of the between-group 
variation of the variables to their total variation. Other researchers 
(Blalock, 1964; Feige and Watts ^ 1972 | Hannanj 1970; 1971) have shown 
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empirically that tlie bias in a regression coefficient depends on the 
relation of the gi'ouping characteristic to the independent and dependent 
variables* 

We propose to trace rationally how aggragacion bias depends on the 
interrelations among the variables of interest. Our structure:*, which 
Includes cases hitherto neglected , will be a taKonomy that contains the 
possible interrelations between the grouping variables and the other ' 
variables* In addition to presenting logical and empirical arguments , 
as in previous studies, we shall develop mathematical f ormaliEation for 
the effects due to the choice of grouping variable. While emphasizing 
bias, we shall also discuss efficiency and precision of regression 
coefficients* Bias in correlation coefficlants will be considered only 
incidentally although a way of estimating gero-order correlations from 
grouped data is also described. 

Aggregation bias will be studied in both bivariate and multivariate 
relationships. The effects of varying the number of groups and the 
number of observations per group will also be considered. The latter 
work will indicate which properties of grouping are most sensitive to 
sample size. The intent is to formulate strategies for minimizing 
Information loss when grouped data are used for individual Inference. 

V* Overview^ of jiater .Chapters 

Earlier literature on estimating correlation and regression 
coefficients from grouped observations is reviewed in Chapter 2. Most 
of the work cited is d^awn from sociology and economics, 

In Chapter 3 we state formally what is known about estimating the 
simple linear regression coefficient from grouped observations and 
extend previous work. Alternative models are discussed. After 
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extending the "structural equations" £ipproHCli (Blalock, 1964; Hannan, 
1970, 1971; 1972) by Incorporating a function of the grouping 
characteristic as a variable in the system, we present a taxonomy, of 
the relations between the "grouping variable" and the other study 
variables i Formulas are then derived for the bias and efficiency of 
variables from each taKonomic category* Finally, we discuss the 
imp°^ ications of the results for investigators using grouped data* 
Other aspects of the single-regressor case are considered in 
Chapter 4 with emphasis on withln=varlable factors such as the number 
of groups and the number of cases per group. We also describe an .-^^ ' 
alternative seheme for characterising the grouping process which 
complements the treatment in Chapter 3, The chapter closes with a 
discussion of ways to examine the effects of grouping on a nominal 
cha r a c r er i s t ic , 

In Chapter 5 we consider the case of two regressors and point 
toward= extension -^to^-any number- of additional independent variables , 
The literature specific to the multivariate case is reviewed, and the 
taxonomic approach is applied to the two^regressor model - 

An empirical demonstration of effects in the single regressor 
case is presented in Chapter 6* Infonnatlon collected from incoming 
freshmen at a large Midwestern university serves as the data base- 
First, for a certain X^Y pair, the regression slope and its sample 
variance are estimated from the ungrouped observations under a simple 
linear model. Then one or another student characteristic is used to 
group observations 5 and the Y^on--X regression slope and its sample 
variance are estimated from the grouped observations. The empirical 
results are shown to conform to the conclusions derived analytically. 
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The use of compDsitea of estimates from different grouping procedures 
Is described; this Improves estimation In certain conteKts. 
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Chapter 2 

REVIEW OF THE LITERATURE 
ON 

GROUPING OF OBSERVATIONS 

Historically, investigations of the effects of grouping on the 
estimation of individuaHevel relations have followed two distinct 
lines of inquiry* On the one handj statistieians and behavioral 
scientists (mostly sociologists) have considered this question while 
studying the "ecological fallacy" (Robinson, 1950) ^ the effects of 
modifiable units (Yule and Kendall, 1950) , and the problems caused by 
a "change in the units of analysis" (Blalocks 1964) . Thesa investiga- 
tions share an interest in the circumstances under which the analysis 
of grouped units inflates estimates of individual--level relations- 

Economists, on the other handj have traditionally treated grouping 
as a legitimate strategy for reducing the cost of analysis. Their 
mathematical formulations have indicated that grouping eimply reduces 
the efficiency of regression estimates without introducing any bias. 
Thus, they have hunted for the most efficient means of forming groups. 
Prais and Aitchlnson (1954) and Cramer (1964) represent this econometric 
tradition* 

In recent years, the distinctions between the approaches have 
blurred as the methodologies of the behaviorial sciences and ecp; ^metrics 
converged, Hannan (1970, 1971| 1972) and Felge and Watts (1972) are 
largely responsible fot this convergence,^ 

Below, we review only the key preseiitations from the two line^ of 
inquiry. Summaries of previous work in these areas have already appeared 



See also Burstein (1974) and Hannan and Bursteln (1974). 
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elsewhere.^ We reserve tlie detailed discussions of certain key 
Investigations for a later chapter. In Chapter 3 we examine work by 
Prais and Aitchinson (1954) , Cramer C1964) , Blalock (1964) , and Hannan 
(1971; 1972) on the effects of grouping on the estimation of simple 
linear regression coefficients. In Chapter 6 our study of the multi= 
variate case is juxtaposed with reviews of work by Prais and Aitchinson 
(1954)* Haitovsky (1966), and Feige and Watts (1972), 

I. Behavioral Scientists- Ferspectives on Grouping 

The earliest articles on the effects of grouping indicated that 
correlation coefficients increase when the size of units (e.g,* census 
tracts) is increased. In 1934s Gehkle and Biehl showed how the 
correlation of total number of male juvenile delinquents with median 
monthly rental in Cleveland, Ohio changed from -.502 as the city's 252 
census tracts were successively grouped into larger regions # The 
magnitude of the correiations increased steadily with the degree of 
aggregatidn: 

NUMBER OF REGIONS 
252 200 175 150 125 100 50 25 
CORRELATION --,502 -.569 -.580 -.606 -=*662 -.667 -.6^5 «,763 



^Selvin (1958), Scheuch (1966), Alker (1969), Allardt (1969), Cartwrlght 
(1969), Shively (1969), and Iversen (1973) among others reviewed the 
grouping literature in the behavioral sciences, focusing on Robinson's 
(1950) work and related papers but offered little significant new 
material. Among the above, only Selvin and Scheuch refer to related 
studies by Gehlke and Biehl (1934), Thorndike (1939), and Yule and 
Kendall (1950)* Johnston (1971) reviews the econometric studies, 
Hannan 's work (1970, 1971; 1972) combines a review of previous work 
with contributions to the theory. 
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Thorndike (1939) demoiiscrated the problems associated with the use 
of groupad data in the course of. his investigation of the determinants 
of intelligence. He pointed out that the correlation between two traits 
(X and Y) in m groups equals the correlation between the traits for 
the individuals composing the groups only under very_special circum';: _ 
stances. He added that the latter correlation was usually much smaller. 

Thorndike then constructed an illustration with Intelligence 
quotient as X , and the number of rooms per parson as Y , and the 
twelve districts of a city as units for aggregation. Within each 
district he creatad a BBmplm of X and Y values such that within 
districts r - 0 * Wlien observations at the individual level were 
subsequently pooled over districts ^ r ^ .45 ; but the between-distrlcts 
correlation of X and Y averages was .90. 

More than ten years passed before questions regarding inferences 
from grouped data reappeared. Yule and Kendall (1950) stated that if the 
units of analysis were modifiable (e,g,p characteristics of geographical 
regions) ^ the magnitude of a correlation depended on the unit chosen. 
Accordingly J correlations "measure the relationship between the variates 
for the specified units chosen for the work" (Yule and Kendall^ 1950 ^ 
p. 312)^ Furthermore, they concluded that whenever units are grouped 
and cbrrelations are calculated from summary characteristics of the 
groups s such as averages ^ the correlations Increase with the size of the 
grouping. Conversely s coefficients decrease as the grouping becomes 
finer. As we shall see, this generalisation is now known to be | 

i 

Incorrect. 

In addition to their citation of the Gehkle and Blehl example. 
Yule and Kendall correlated the yields of wheat and potatoes from 48 
agricultural counties in England in 1936 and successively halved the 
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number of units by combining contiguous areas (forming 24, 12 5 6, and 3 
units). These groupings yielded correlations of *219j ,296, *5765 .765 3 
and *990j respactively . 

Sociologists and political s^^ientists dominated the literature 
dealing with grouping for most of the next twenty years* The early 
sociological invesclgations typically focused upon bivariate relations 
between^ualltative variables where the observations were grouped by 
Ipcatlon (e.g. J state) ^ by social organization (e.g. 5 school) ^ or by 
temporal occurrence (e.g*, quarterly statistics). Investigators were 
generally concerned about the consequences of using such data to make 
inferences about the ungrouped observations. These analysts' problems 
were amplified by their lack of Gontrol over the grouping process * 

The article by Robinson (1950) on the "ecological fallacy" 
triggered one of the liveliest niethodological debates in the postwar 
period" (Scheuch, 1966, p. 148). Alker (1969) described the surprise^ 
dismays and rage of users of ecological data that Robinson caused with 
his demonstration that statistical associations for aggregated popula- 
tions can differ in magnitude and even in sign from those for individuals 
Robinson advised a distinction between "individual correlations" ^ which 
he defined as a correlation between indivisible objects, and "ecological 
correlations" I where the statistical objects are defined as a group of 
persons. He warned against treating ecological correlations as if they 
were individual correlations. Robinson considered it to be an 
"ecological fallacy" to use data grouped by territorial units as if 
they were measurements on Individuals. 

The avowed purpose of Robinson's paper was to provide a mathemati^ 
cal formulation of the exact relation between ecological and individual 
correlations and to show how that gelation reflected upon the practice 
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of using ecological correlations as substitutes for Individual correla- 

tions. His analyses on race vs. illiteracy and race vs, nativity (see 

Table 2.1 below) are illustrative. 

Robinson's explanation can be sunrnarized as follows i 

1) The Individual correlation depends upon the internal 

s^within-cell) frequencies of the within--areas contingency 
tables s while the ecological correlation depends upon the 
marginal frequencies of the within^areas contingency 
tables. 

il) Since the within-group marginal frequencies from which the 
ecological correlation is computed do not fix the internal 
frequencies J which determine the individual correlation^ 
there need not be any correspondence between the individual 
and ecological correlations. 

According to Robinson, the mathematical relation between individual 

and ecological correlations can be written as 

[2.1] - k^T = k^r^ 



where 



and 



In these equations , r is the correlation between X and Y for 

all N persons; r„ is the "ecological" correlation, the. weighted 

E 

correlation between ra pairs of X and Y percentages which describe 



In Robinson's opinion^ ecological correlations were used ^simply because 
measures on individuals were not available, _ others, beginning with 
Mendel (1950), pointed out that relations among the properties of 
collectives can have their own inherent value. Questions regarding 
appropriate units of analysis remain outside the domain of this 
investigation. We are only Intei'ested in inferences about the 
relations at the level of individuals when the analysis Is performed 
on grouped data. u 1 ^ - 
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Table 2.1. Correlations of illiteracy with race and illiteracy with 
nativity at different levels of aggregation^* 



Value of r Value of r 

Description of Units (illlteraGy and race) (illiteracy and nativity) 

97^272,000 persons .203 *118 

48 states ,773 -,526 

9 geographic regions ,946 " ^,619 



^The correlations are Pearsonlan fourfold correlations based on data 
from the 1930 U,S, Census, The three attributes are all dichotomous 
(literate vs* illiterate; Negro vs. Non^-Negroi Native-born vs. 
Foreign'-born) . . 
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the subgroups in a fourfold tablet ^nd r^^ Is the average of the m 
within-group correlations between X and Y , each wi thin-group 
corrGlation being weighted by group size. Also, and are the 

A Y 

correlation ratios (the ratio of the between--group variation to the 
total variation) which raeasura the degree to which values of X and Y 
cluster by group* 

From equation [2,1] s Robinson was able to deduce that the indivi= 
dual and ecological correlations are equal only when 

[2.2] = kgr , 

where 

k3 = — -- — — — - — . 

/I - nj A - n§ 

However, since the minimum value of k3 is unity,® the Individual and 
ecological correlations can be equal only If the average wlthln=group 
cprrelatlon is larger than the individual correlation. This is counter 
to eKperiencei hence there is no reason to expect equivalence of the 
ecological and individual correlations. 

®In the unlikely case that either correlation ratio equa].s Ij the value 
of k is undefined. Otherwise ^ for any two numbers a and b ^ 

1 = ab 

1 < -- - ' " " 

/(I^a2)(l^^b2) 

1 - a^ - b^ + a^b^ i 1 ^ 2ab + a^b^ (multiplying by the denominator and 

squaring both sides) 

0 <_ a^ + b^ - 2ab 
0 < Ca=^b)2 

and thus, since we can let a ^ snd b ^ Hy i the minimum valuG 

of k3 is . 1 • 
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Equation [2,1] also suggested to Robinson how the ecalogical 
correlation dapendcd upon the number of subgroups. He pointed out the 

following effects of consolidating units I 

i) The ecological correlation decreases as the groups become 
more heterogenous since Increases directly with 

increasing group siEe and the between-group proportion 
of the variation equals 1 r^ * 

11) The correlation ratios and n| decrease as the 

between^groups variation becomei smaller* 

lli) Of the two effects s the changes in the correlation ratios 
are considerably more Important than the changes in r^ 
so that the numerical value of the ecological correlation 
Increases with increasing consolidation of units. 

After Roblnsonj the emphasis in studies of the effects of grouping 
shifted to a search for couditions under which the bias from grouping 
can be minimised. Duncan and Davis (1953) developed an estimate of the 
siEe of the error when aggregated data are used to predict individual- 
level relations. They examined successive subdivisions of a territorial 
unit (in their example ^ census tracts) and used the differences in the 
ecological correlations that were obtained for the units of varying size 
as the beat estimate of the size of the ecological fallacy. They 
concluded that "although different systems of territorial subdivision 
give different results^ ... the criterion for choice among these results 
is clear. The individual correlation is approximated most closely by 
the least inaximum and the greatest minimum amongst the results from 
several systems of territorial subdivision" (Duncan and Davis ^ 1953, 
p, 666) • 

Goodman (1953; 1959) proposed the use of ecological regression 
coefficients J rather than ecological correlations, in any attempt to 
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define the circumstances that reduce the problems Robinson had identified 
Goodman's form of ecological regression is appropriate for variables 
which are measured nominally or ordinally, and his method though requir- 
ing some difficult assumptions, Is more efficient than the Duncan-Davis 
method of setting bounds* 

Briefly, his method is as follows* Let Y be the proportion of the 
total population who are illiterate, X be the proportion of the total 
population who are Negroes, p be the proportion of Negroes who are also 
Illiterate, and q be the proportion of Whites who are also illiterate. 
Finally, let the groups represent samples from the population of X and 

Y values. Then, If (a) population parameters p and q do not differ 
from area to area and (b) ECY) ^ Xp + (1 - X)q — where X is as 
defined above and E(Y) is the eKpected proportion of illiterate people 
in an area — - the standard least-squares approach yields unbiased 
estimates of p and q and thereby of the slope of the regression of 

Y on X , Furthermore, if the values of Y are approKlmately normally 
distributed with the same variance for each value of X , all standard 
regression methods also apply. 

Thus, according to Goodman (1959, p. 614), the only assumption 
ne.ceasary to Justify his estimation procedures is that p and q **must 
be more or less constant for the different ecological areas in such a 
way that the standard linear regression model can be applied"* His 
estimates of the individual-level parameters in the Robinson and Duncan- 
Davis examples were a vast improvement over those from ecological 
correlations or the Duncan-Davis bounds* 

Blalock-s CKamlnation of "change in the units of analysis" problems 
was the first break from the consideration of eKdluslvely nominal and 
ordinal variables, Blalock (1964) used a causal framework to examine 
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empirically the effects of grouping strategics on the correlacion 

coefficient r and tha regression coefflGlents b., and b_ . He 

YX XY 

placed artificial restrictions on the grouping criterion in order to 
niter the variation among X and Y variables in specific waysi to 
maxinilze variation in X , to maximize variation in Y , and to minimlzo 
the effects of grouping on both variables (random grpuplng) , Fourthly^ 
area! units ware grouped by proKimity, 

Blalock demonstrated that r remained unchanged only under 
random grouping* When Y was the dependent variable^ both random 
grouping and maKimizing variation in X left the estimate of b 

YX 

unchanged; but the variance of the slope estimate increased. However, 
b^Y affected by maximizing variation In X . Thus, If one is to 

infer individual-level relationships from aggregated dataj Individuals 
have to be grouped in such a way that their scores on the dependent 
variable are related to group membership only indirectly, through their 
scores on the independent variable, 

11* Econometric Perspectives on Groupinf^ "Optimal Grouping" 

Econometricians have traditionally followed an entirely separate 
line of Inquiry* The problems they have attempted to solve are those 
caused by an overabundance of data. They consider the practical 
problems facing an Investigator who can choose among a variety of 
grouping methods. Prais and Altchlnson (1954) and Cramer (1964) have 
done basic work to be recounted in detail later. Herfis we provide only 
a short summary. 

Within a general regression model, Prals and Altchlnson (1954) set 



out to estimate the regression parameters g , 0 for K 

regressors, and the varionces of the estimators from the individual and 
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grouped observations* Following classical least^squares procedures, 
they showed that, whatever the mGthod of grouping (a) the resulting 
estimators are always unbiased, (b) the variances of the estimators 
based on grouped data are always greater than those of the estimators 
from the original observations, and (c) the efficiency of grouped 
estimators is optimized by maKimi^ing the between-^groups variation in 
the regressors* 

For most of the 1950 's and 60' s^ the Prals--Aitchlnson results 
defined the state of the econometric knowledge on the topic. Cramer 
(1964) I following the Prais-^Altchlnson approachj focused on strategies 
for optimal grouping in the two-variable case without seriously 
considering the possibility of bias* He evaluated certain efficient 
grouping procedures under conditions common to economic survey analysis 
and provided empirical examples on optimal grouping from the literature 
on economics* 

Haitovsky (1966; 1973) did not follow the path laid out by Prais- 
Aitchlnson and Cramer* Instead,' he studied alternative ways of 
estimating multiple--regression coefficients when the data are in the 
form of one-way classification tables for which the cell frequencies of 
the cross-olaaslf ications are not available. His most Important 
contribution is his empirical evidence that grouping on one repressor 
can lead to biased estimators when the hypothesised model contains 
multiple regressors. 

Recent work by Feige and Watts (1972) is even more definitive in 
the multlple^regrcssor case. They considered the analytical consequences 
of "partial aggregation" as a means of performiug individual-level analy- 
sis while preserving the confidentiality of data* Perhaps this new 
substantive focus explains how they found differences between estimators 
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of regression coefficients based on the individual and grouped data, a 
result contrary to the findings- of Prais and Altchinson but in accordance 
with Blalock's findings. They attributed the differences to one of three 
sources! (i) specification bias (omission of regressors) , 
(ii) bias introduced by a grouping transformation that is not Independent 
of the disturbances j or (iii) sampling error introduced by the use of 
less information in the grouped regression. They also provided new 
criteria for judging the bias and efficiency of grouping methods. We 
shall explore their work and Haltovsky's In more detail In Chapter 6. 

Hannan (1970a, 1971 | 1972) Integrated the various approaches to the 
asgregation problems discussed herein. His extension of Blalock's causal 
logic is particularly pertinent to future application of this technique 
to the problems of grouping. The concluding remarks of Hannan 's book 
on aggregation (1971, pp. 116-117) Identiflad the areas where the 
knowledge of grouping effects was az u limited,. He called for expanding 
our understanding of the conEeJiiimces -of estimating individual-level 
relations from grouped datn, thk orublem of the present inquiry. 
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CllAPTER 3 

ESTIMATION OF THE LINEAR REGRESSIO*: COEFFICIENT 

FROM 

GROUPED DATA IN THE SINGLE-^ REGRESSOR CASE 

Chapter 3 focuses on the substantive factors that determine the 
effects of using grouped data to estimate the relations that exist in 
data on individuals. For the time beings we consider a linear model 
with a single regressor X leaving multivariate problems to Chapter 5* 

As a point of departure ^ the methods empldyed by Prais and 
Aitchinson (1954) and by Cramer (1964) for examining regression coeffi- 
cients from grouped data are presented. These methods represent the 
general econometric approach to the effects of grouping of observations 
prior to recent work by Haitovsky (1966) and by Feige and Watts (1972) , 
(See Chapter 5 for further discussion of their work,) Potential problems 
with the earlier econometric approach are cited. The approach of the 
sociologists Blalock (1964) and Hannan (1970; 1971) is discussed as an 
alternative to the econometric conceptualiEation of grouping effects. 

The remainder of the chapter is devoted to attempts to develop a 
mathematical formulation that will account for the grouping effects 
described by Blalock and Hannan* The concept of a "grouping variable" 
is introduced to emphasise the relations of the chosen grouping charac-^ 
terlstic to the variables of interest* The simple linear model is 
replaced by a structure which incorporates an interval grouping variable 
Z . A taxonomy is then. generated by considering possible linear rela- 
tions of Z to the regressor X and of E to the regressand Y after 
adjusting for the relation of Z to X • Four categories result when 
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Z Is placed prior to X and Y in the model. 

The bias and, where appropriate, the relative efficiency of estima- 
ting the regression coefficient (Sy^) at the individual level from the 
gorup means are eKainined for each taxonomic category. The results indl-- 
cate that grouping can yield either a biased or an unbiased estimator. 
The model which incorporates the grouping variable is found to be better 
suited for treating the problems of data aggregation than the analytical 
methods of Prais-Aitchinson and Cramer, In particular ^ the altered 
model leads to an explicit formulation of the expected bias due to 
grouping by a variable having specified relations to the X and Y 
variables, 

I* Terminology and Notation 

Three types of variables are considered! dependent ^ independent ^ 
and grouping.^ A dependent variable^ or regressandp is an "outcome" or 
an "effact" in educational investigations. Only the case of a singla 
dependent variable (Y) will be treated- 

The independent variables j or regressorSj are those the investigator 
studies as "causes" ^ "determiners", or "predictors" of the variation in 
the dependent variable * VTliere there are multiple independent variables, 
X (X^^^j X^^^)] denotes the k-dimensional vector representation 

for the complete set^ and X^^^ refers to any one variable; q ^ 1| 
k * Wlien there is only one Independent variable, the superscript will 
be dropped. 

Typically^ values of the independent variable ere assumed to have 

^ Other writers make no formal use of a "grouping variable"* Some speak 
informally of "the method of grouping" [sees ^'g*? Prals and Aitchinson 
(1954) and Cramer (1964)]* 
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been established prior to those of the dependent variablSi For exampleV 

parents' income is logically prior to Che educational achievement of 

their children. Income could be an independent variable since it is not 

an "outcome" but rather a potential "cause** of student achievement. The 

model specified for the relations among variables takes this *'order** " 

into account and differentiates between causes and outcomes at each 

step in the, chain. 

There is a grouping characteristic. In practice, a function of the 

grouping characteristics which we shall label as Z or . , assigns 

Cm) 

the original observations to cardinally numbered groups* The model to 
be developed in this chapter requires that the values of Z be repre- ; 
sented on an interval scale* In Chapter 4 we shall discuss how the ^ 
model can be used to understand the bias introduced by grouping a 
variable that is merely nominal or ordinal.^ — • r 

If a grouping variable is formed from the student characteristic 
"number of yeare of mathematics** by use of the rubrics '^0--i"5 *'2-3*'v ? 
"4 or more"j Z - 2 when a student has had more th^ 1 year of mathematics 
and less than 4* More formally ^ when Z Is intervalp an individual 
belongs to group i if his value on the grouping characteristic is 
greater than U^_^ 5 the upper bound in the range for group i^l , and 
less than L^^,^ * lower bound of the _range for group i+1,^ 

^Ordinal grouping variables can be treated in the same manner as nominal 
variables. Alternatively^ a non^linear transformation can be performed 
on the categories of the ordinal grouping variable so that it can be 
treated as interval, -In this case each non-'linear transformation yields 
a different grouping variable with different relations to the other 
study variables, 

^It is also possible to generate an interval g\ uping variable from an 
unordered grouping characteristic by appropriate scaling procedures 
(e.g*, scaling of father's occupation). This option will be discussed 
in Chapter 4* 
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Moreover, the value of Z associated with members of group i will be: 
Che mean of the group for the grouping characteristic, 

Groupiiig characteristics can (unless binary) be recodad in alterna- 

variableB" Z * When necessary » . is used to emphasise that a 

W 

particular grouping variable forms m groupSp perhaps in contrast to an 
alternative Z^^,^ , A Z^^^ may also be contrasted with:^^^^ 
^(m) ^^^"^ divides the scale on the grouping characteristic differently* 

A grouping variable can be generated from an independent variable 
or a dependent variablep or in some other way. In a study of the relation 
of parental income (X) to educational achievement (Y) ^ the grouping ; 
characteristic could be X and the grouping variable something like 
"decile rank, in the populationp of parents' incQme", Hsrep the inde- 
pendent variable and grouping variable are both functions of parental 
income though their numerical forms differ* X may have been given in 
the form of actual dollars of Income or in terms of . income percentiles. 
The grouping variable Z has the possible values 1^ **,p 10 | so tan 
groups can be formed. 

Often, the grouping variable is distinct from both X and Y * 
For example j observations can be grouped on "father's education", 
"student's sex", or for that matter , "third letter in student's last 
name". Indeed ^ the values of Z can be numbers assigned to persons at 
random, in which case Z is unrelated to X and Y . 

Our models will specify relations among X, Yp and Z, rather 
than among X, Y, and the grouping characteristic. This is done 
because observations are actually grouped on a particular Z and two 
grouping variables generated from the same grouping characteristic can 
. have different relations to X and Y . 
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A. The Structure Among the Variables 

The relations of interest are the structural relations of Y 'to 
X . The regression equations represent the presumed underlying 
structure among the a given X ^ there are three possi- 

ble structural models for the relation between X and Y i (a) X 
determines Y (b) Y determines X , (c) there is a reciprocal 
relation, (There may be other determiners of X any:"::Y denoted by 
u * When necessary^ a subscript is attached to u to Identify the 
variable influenced ,) The trivial case of no relation can be ignored. 

This investigation concentrates on model (a), which can be 
represented by the path diagram 

^YX 

X " " I. Y 

The arrows in the diagram Indicate the direction of influence | in this 
case 5 X determines Y , In a linear model , g is the coefficient 
from the regression of Y on X * u^^ represents all the determiners 
of Y that are linearly independent of X , Hence u^ includes errors of 
measurement in Y , the effects on Y of variables other than X and 
residuals due to any lack of fit of the linear model* Effects such as 
are known as '-disturbances" or "disturbance terms'*. A disturbance 
^^^^^ ^® added prior to X , but this disturbance does not affect 
nhe X-Y relation* of this model. 

The relation depicted in model (a) can be Identified by the 
"structural equation" Y ^ o + B^.^X + u^ • This equation specifies 
that Y can be partitioned into a constant part, a common part due to 
its linear relation to X , and a residual part, independent of X * 

43 



34- 

The Independent variable X is not partitioned. In factor-analytic 
terras, two factors can be chDian to account for X and Y , a factor 
defined by X and a factor for the residual part of Y (that Is to 
say, for Y*X ). . 
B, Notation ^-^^^^y . . ^ 

We begin with N persons, 1, N , These can be divided 

among m "groups" on the basis of their Z values. Throughout most 
of this Imrestlgation, the "p" la recoded as Ij for cleirer 
designation of group memberships. With the :ij notation. Group 1"^ -^^ 
contains n persons, n^ + + n , + ... + n - N / The labels 
^ij* ^'ij* ^ij id^^tify the scores of the jth member in the 1th 

group (1^1, m I j ^ 1, n^), ' . / 

Following a standard convention^ X-^, f and 1^^ represent 
grand means and X^^, Y^^, and 1^^, the raeans for group 1 . Under 
the assumptions made in this Investigation ,r Z^^ ^ The disturb- 

ances u have group means , (Later there will be other 

f-j 1* 

disturbance terms v and w , to which the same conventions apply.)* 

Throughout the analyses, population variances and covarlances are 
denoted by g^, o|, and so on. Also of Interest are population 

correlation coeffieients and o^^ the coefficient, , 

describing the regression of X on Z . The partiai regression "^^^^^^^ " "^^^ 
coefficients Syx^Z ^YZ-X are important later. In the notation^, 

for partialSp the effects of the variable placed after the Vave 
been controlled when considering the- .relation of Y to the other regressor 

Additional notation is needed when the sample of persons is only 
a subset of the population. For the total sample, a sum of squares or 
sum of crosB^products [deviated from the appropriate mean (s) ] is 
identified by SS^C ) . For example, SS^(X) , denotes the total sum 
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of squared deviations of X, , from the grand mean! 



m n . _ 
SS^(X) - E E^(X - X_)2 

^ i-1 j-1 



Similarly I SS^(X) represents the between^group sum of squares for 
B 



m n . 



SS_CX) - E E^(X - X_)^ 
^ i-1 j-1 ^ 

- m 

^ E n^CX. - X )^ 
. ^ i 1* 

»i 

Wa shall use SS„CX) to denote a within-group sum of squares! 
W 

SSy(X) - 1 E^(X - X )2 , 

, ^ i-1 j^i ^ . 

^The sum of cross-produets of X and Y %^111 be denoted by SCX^Y) 
[SCX^Y) for betwaen-group sum of cross-products]. V( ) and C( , ) 
denote the sampla variances and covariances ~ the sum of squares and 
sum of cross-products divided by N«l , respectively* The sample 
values of correlation coefficients are represented by ^XZ^ 
so on* 

C* Assumptions About Sarapling 

In the singls=regressor case, our analytical work is based on two 
sets of assumptions about the sampling of observations. In the simpler 
case, we take our sample of N persons to be the population of interest • 
The investigator can then determine B^^ from the ungrouped observa- 
tions * 

l^hen observations are grouped on the basis of some Z , the 
regression analysis performed on the group means Q^^^ ^^'^, 
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(waighted by group size ) generates the coefficient from 
the population of group means* In this cases where the sample equals 
the population^ our central questions have to do with the adequaGy of 
6-- as a substitute for ; l,e, what is the value of g^^ ^ ^yX ' 

Alternatively 5 we may assume that the persons are a random sample 
from the population with the constraints that the groups are an 
exhaustive sample of the values of Z ^ and the sizes of the groups in 
the sample are directly proportional to the sizes of the groups in the 
population. These conditions amount to the implicit assumption that 
we draw a proportionate stratified sample with strata defined by the 
values of E . However, we treat the observations as if they are a 
simple random sample from the population. 

Under the latter sampling assumption (sample # population) , the 

estimator of based on the ungrouped observations is denoted by 

h ^ and its variance over a hypothetical population of independent 
^X 

random samples is denoted by ^ ^C^yX^ ' sample estimator of 0— 

from the weighted group means is denoted by B^^ p and ^(Sfx^ 
represents its variance over samples. 

The bias from using, grouped data in this case is reflected in 
the difference between the expected value of B~ and S^x^^^^YX^^^YX ' 
where expectation is over all i and J ]* The difference between 

2~ and b provides an estimate of the bias due to grouping, 
YX YX 5 

The relative efficiency of b^^ and B^^ as estimators of 

is determined by comparing vCb^^) to MSE(B~) , where 

MSE(B~) = V(Byj) + [E(B~) - PyX^l- • ^^n b^^ and B^- are 

unbiased estimators of and 0~ , respectively ^ MSECB|g) ^ ^^^yx^ 

^ (g^^ ^ B )^ estimates the mean squared error from estimating 

YX YX . .. : ~ 

from B-- ...... 40 
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Table 3,1 summarizes the alternative sampling procedures and the 
Tneasures of precision in estimating from grouped observations under 
each procedure* 

When the data represent a subsample of the population^ sarapllng 
bias can contribute to the discrepancy between parameter estimates from 
grouped and ungrouped data, ThuSj we potentially confound grouping bias 
with sampling bias. Treating a proportionate stratified sample as if 
it \?are a simple random sample also offers haEards for interpretation. 
Later j when we talk about bias due to groupings we do not make the 
distinction between sampling bias and grouping bias. The combined 
quantity is attributed to what we call grouping bias or discrepancy,^ 

The assumption of exhaustive sampling of the values of Z causes 
no special problems when Z is an interval grouping variable based on 
an interval grouping characteristic* However^ whenever the characterise 
tic is nominal p such as school or classroom^ the generality of conclu= 
sions are restricted by requiring exhaustive sampling. The investigator 
would like to generalige beyond the classrooms he samples. In any case^ 
the classrooms sampled should at least be randomly representative of 
some population of interest and lack of representativeness introduces 
additional bias. This source of bias is also attributed to grouping 
under the prescribed analytical procedures , 



^Feige and Watts (1972) add specification bias as a third confounding 
source for the difference between grouped and ungrouped coefficients. 
In facts Feige (personal communication) believes that what we call 
grouping bias is actually specification bias arising from the ■ 
omission of a relevant variable from the Initial model. We do not 
dlBagree with this interpretation. However, the generality of the 
notion of specification bias fails to capture the fact that an 
investigator may be Interested in estimating the simple linear 
regression coefficient at the individual levels and his problem 
arises mainly because he must analyse aggregated data, 
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Table 3,1. Indicis of precision of istiifites from groupad data as I' function of saiplin| procidure. 



Msisuri of Prieision 



Nature of Safflpls Deierlption of Sampling Proeedure Dlierepancy' 



Saiaple s Population The sirapli of H perions eonstitutis M - 

the population. ' „L... J^ 



Jfficiency 



Siiple f Populition 



N ptrsons irs eaiplid rindoinly from 
the populition in such I wiy that 
all poisibli Yiluis for 2 irs 
sampled in proportion to the sizes 
of thi groups in the population. 



d-B||M^^ lff(B,i 



£e(b--) 



Eidiscrepincy) - Bias 



18 



ERIC 
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II* ReRrcsslon Co^fflclenus to ba Contrasted 

Between-groups regression coefficients can always be estimated from 
grouped data* For example^ assume that in an investigation of the 
relation between achievement and income ^ students are grouped on the 
basis of fathers' eduGatlon* In this situation | the averages of parental 
Income and student achievement at successive levels of fathers' education 
and the group sizes n^ become the data for the regression analysis. 
The investigator can then calculate B^^^ , the slope of the regression of 
group means of achievement on means for income. This is an unbiased 
estimate of 0-- , 

However s the purpose of the investigation is to learn about the 
yngrouped regression coefficient 6„„ . Our question then is '*what is 
the relation of B— to That Is, we want to know the conditions 

under which the slope estimator (B~) from the betweeii-groups regression 
Is an unbiased (or possibly just consistent) and efficient estimator of 
the slope (6™) from the regression of Y on X using the ungrouped 
observations. The rest of this' inquiry moves toward a statement of these 
conditions* 

III* The Bivariate Case — Standard Model 

We first present a standard statistical model for the relation of 
y to X in the ungrouped observations and in the group means. A 
discussion of the formulation by Cramer (1964) follows^, with 
digressions to call attention to important problems of application. 
Section III.E., in particular, is devoted to the effects of violating 



5 We concentrate hexe^on Cramer's bivariate regression analysis rather 
than the multiples-regression work done by Prals and Aitchinson, The 
latter will be. discussed in more detail in Chapter 5, 
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assumptions on the Prais-Aitchinson and Cramer conclusions^ Finally 
we discuss work by Blalock (1964) and Hannan <1970, 1971 | 1972) which 
delineates the eflects of grouping in a more realistic. manner than the 
Prais-Aitchinson and Cramer treatments.. Throughout this section we 
deal with the case of subsample from the population* 

A. Regression ^alysis of the Ungrouped Observations 
When a sample of N persons^ p ^ 1^ N * drawn from the 

population, the relation between and^ is^described by the 

regression equation 

[3.1] ' PyX^P +"p . 

where 

One set of assumptions for this model (with random X ) is 

Al- The X are random variables distributed independently of the 
P 

u . 

P 

A2. The u are Independent random disturbances with E(u ) - 0 
P , P 

and VCu ) - o^ for all p . 
p u 

In this case the least-=s luares estimator of g^x ^^^^ ^^^^ sample 
of iudividual data is given by 



C 



[3.3] b 



(X ,Y ) E^CX " X )(Y ^ Y ) 



YX vex ) N 



Wlien [3*2] is substituted" for Y^ in [3.3] and the expectation 
taken s we obtain (by summation over persons) 
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[3,4] 



VCXp) 



E(X - X ) Cu -7 3 ) 



P 



Since the disCurbances u and the regressor X are assumad to be 

P P 

independmt' by Al, the second term is zero. So 



and b Lb an unbiased estimator of g 

X A YX 



en the u are 
P 



normally J istributGdj, b^^ Is also the maximum likelihood estimator*) 

Und'^ i; ciBS'imptions Al and A2j the variance of b can be shown 
to be (seej 'i.o. , Goldbergerj 1964, p. 267) 

2 



[3.5] 



V(b^^) - E 



CCX u ) 



V<X.) 



« E 



C(X ,u ) 

P P 



E (X - X )' 
P=l P • 



If the data satisfy the assumptions on the X and u and the 
sampling assumptions , then within the class of linear unbiased estima*^ 
tors of the linear regression coefficient of Y on X , b^^ is the 
estimator with minimum variance (see, e*g*j Goldberger, ISSAj p, 269) , 
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B, Regression Estimation from Data on Groups 

Double subscripts are needed for the sample observations whan 

is estiraated from data on groups. The "p" are recoded as "ij" (see 

Section I1;B.)* Equation [3»1] becomes 



[3.1] = « + Bvx\j 



That is, ij (group i , Jth member) replaces P * / 

We can retain the definitions of b^^ and VCb^^) given by [3.3] 
and [3*5] J as no change in assumptions has been made about data at the 
individual level* Note, in par ticular-i that we have assumed sampling 
of individuals as i,J units, and not sampling of 1 and of 1 within 
1 , 

In eatimating the regression coefficient from group means ^ any 

ordering of the groups Is Ignored* The wlthin-group means , weighted by 

the number of observations in the group (n^ ) , replace the ^^^j^^^^ 

pairs ^ and the regression equation relating the Y^^ to the X^^ is 

estimated* We shall hereafter refer to g=j as the population value 

of the least-squares coefficient predicting Y^^ from X^^ ^ where the 

means are weighted in proportion to group size in the population, a 

will denote the Intercept in this equation. 

The relation between Y and X^ is described by the regression 

X* i' 

equation 



[3.6] Y - a + SgjX^. + u^. 



This equation has the same form as [3.1] where now the group means play 
the role of "individuals". If the assumption about the u_ holds for 
the ungrouped observations ^ the analogous statements also hold for the 



grouped observations* (E.g.^ E(u^^) ^ 0 *) 
We dafine 

m n ' 



C(X. ) 



and 



N-1 

m _ 

2 n, (X, - X )Cf . - Y ) 
N-1 

m ti. 



N-1 



m 
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i«i ^ ^ 

N-1 



where group means have been weighted by their corresponding n^ , The 
weighted least-squares estimator, B-- , of S^^^^ then 



[3.7] 



Whan [3,6] is substituted - for Y^^ in [3*7] ^ and the eKpectation 
taken, we obtain 



[3.8] 



g<^i.-"i.^ 



Since u and X ara indepandantly distributed, the second term is 



Eero^ and B~ is an unbiased Estimator of 



YX 



' Under the assumptions ' Al and A2 , the variance of the grouped 
estimator is 



44 



[3,9] 



^ E 



m 



C* Bias and Efficiency of Estiinating B from Grouped Observa-^ 
tions 

-Though B^^ is an unbiased astimator of , we are intarested 

in its adaquacy as an estimator of S^^ , the coefficient from the 

ungrouped observations* If wa let d ^ B^g - represent the discre- 
pancy in estimating from B^- ^ then the bias from groupings 8 , 
can be written 



[3.10] 



8 = E(d) » E(B~ - g^xJ 



" ^YX " ^YX 



Since E(b ) = g , we may also write 



e - ECd) ^ E(B~ - b^^) 



13.11] 

According to [3*10], the bias in estimating from is 



Eero when the population value of the regression coefficient from 
grouped data equals the population value of the coefficient from the 
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ungrouped observaUioua * Furthtirmore , by [3*11], the bias can be 
evaluated by comparing the grouped estiinacor B~ with the ungrouped 
estimator b^^ * 

Vie also want to evtiluate the efficiency of estimator b^^^ relative 
to estimator B^^ in estimating the regression coefficient from un^ 
grouped data. For the time beings we shall take as our index of the 
efficiency of the grouped estimator, the ratio of the mean-squared error 
of b^„ to the mean-squared error of B" in estimating I ; namely. 



[3.12] Eff Cb_, B~) = 

MSE(B^j) 



since b.™ and Br;;; are unbiased estimators of g and g~ , 
YX YX lA 

reBpectlvely i 

When - 0„„ , the efficiency index [3.12] can be written as a 

JL A Y A 

ratio of expectations involving the between-group and total sums of 
squares of X by substitution from [3,5] and [3.9]: 

V(b ) 

[3.13] EffCb^^, B==) 

°u^t||^ 3 

From the theorems on the components of variance, the total sum of 
squares over all N observations can be decomposed in the following 
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manner : 

SS^(X) - SSg(X) + SSy(X) 
(Total) (Between) (Within) 

so that 

SS^(X) < SS^(X) 
Because all terms are non^negatlvet 



1 1 

< 



SS^(X) SSg(X) 



and 



Consequently J Eff (b , B--) <_ 1 , and B-- Is generally less efficient 
than b^^ . 

Furthermore, according to [3.13]^ a grouping procedure that maxi- 
mizes the between--group sum of squares of the Independent variable leads 
to more efficient estimates* That Is, one prefers a procedure which 
forms groups homogeneous In X^j • So, of those grouping procedures 
that yield unbiased estimators of Svv * which maKlmlzes 

(minimises) the between-^group (withln-group) sum of squares of the In- 
dependent variable yields the best estimates. 

Dt Differences from Cramer's Formulation 

The analytical work of Cramer (1964) differs in two respects from 

what has been done so far. First Cramer assumes that the Ik^^ are 

fiKed and given, making the additive disturbance the only random element* 

Under the assumption of fixed X^ . ^ the sums of squares involving X,, 
^ I3 , ij 

are constants and the expressions for the variances of the estimators 
can be simplified. That is when the X^^ are fixed and given, [3,5] 
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and [3.9] can be written as 



o2 



and 

VCB--) = - " 



YX' SSg(X) 



respectively. 



Moroover, when = , the efficiency of the grouped 



estimator becomes 



SSg(X) 



-4 

This Hy Is the correlation ratio. 

Here, again, we see that grouped estimators that maximise the 

between-^group variation in the X j i.e# that maKiraize * yield 

ij A 

the most efficient estimators. Thus, conclusions about the efficleney 
of estimation are not affected by whether . are assumed to be fiKed 
or random. 

The other major difference in the Cramer formulation involves his 

assumptions regarding the sampling of observations and the effects of 

grouping on the population parameters to be estimated* According to 

Cramer, the sample of N observations ^^ij*^ij^ " from the outset 

divided Into m groups of n. observations each, The X,, are 

, ^ ij 

fixed and given j and the Y^^ are repeated samples defined by 



[3.11 Y. - a + SvvX, . + ii.. i [his equation (1)] 
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where a and 0^^ are unknowti constants" (Cramer, 1964, p, 235, empha=- 
sis added) * His assumptions about the disturbances u^^ are equivalent 
to assuiriptions Al and A2 above, 

Cramer further states that it follows from his equation (1) that 

^1. ' ° Vi. + "i. 

That is, he assumes that the act of averaging observations within groups 
- does not alter the model assumed to be generating the observations and 
thus does not affect the parameters that are to be estimated. Thus, 
from his analysis, we would conclude that B~ and b as given by 
our [3t3] and [3t7], respectively, are both unbiased estimates of B„„ * 
(This is also the conclusion reached by Cramer.) 

In Section III*B*, we state that the equation relating Y^^ to the 
X. is 

[3.6] Y.^ - / + B^^X^^ + u^^ 

where parameters a and S~ may differ from the parameters a and 
6™ for the ungrouped observations. This is an important distinction 
that foreshadows our differing conclusions regarding possible bias from 
grouping* In the next section we consider how Cramer ^s assumptions 
caused him to overlook several plausible grouping procedures that can 
result in biased estimation * 

Et Implications of Assumptions for Equation [3.1] 
Not all methods of grouping meet the conditions implied by 
assumptions Al and A2 ; neither Cramer nor Prais-^Aitchinson notes 
this explicitly. For eKample, if the data of students from the school 
districts of California are used to estimate the regression of student 
achievement on parental income, it is plausible that the mean distur* 
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banca will vary according to school districi:* This would mean that 
E(u, - y^. 5 not necessarily zero or constant. But unless 

E(u^^ ^ u ) =0 , we are unable to simplify equations [3.4] and [3.8] 
whan X.. aro random variables. That is, if the u.. have a non-zero 
expectation, and B~ are biased estiniators of their respective 

parameters * 

Heteroscedasticlty and interdependence among the disturbances are 
other plausible complications. Assumption Al no longer holds* Under 
these conditions 5 , the disturbances can be described instead by the equa^ 
tion 

Gov (u.^.u.,,,) " O^n • 
ij I'j- u- 

where Q is an N ^ N covariance matrix whose of f ^diagonal cells need 
not be zero* The elements on the diagonal (variances) may vary accord-- 
ing to group (district) and the covariance within a gr'oup can be non- 
zero; that is, E(u..sU.^,) ^ 0^ 0 * 

ij ij' Uj^ 

Wlien heteroscedasticity and interdependence of disturbances are 
present J least-squares estimators are still unbiased^ but they no longer 
have minimum mean^-squared error (cf#| Goldbergerj 1964, pp* 231-243), 
this problem can be overcome by transforming the observations so that 
they satisfy A2 and estimating the parameters from the transformed 
data. For example, when heteroscedasticlty is strictly a fur4Ction of 
differences In gr.oup size [that is, when 0 = diag(n^| *^^) ] * 

weighted least-squares procedures using the grouped data perform the 
necessary adjustmentss With more serious complications, as when 0 
is unknown, econometricians generally place restrictions on Q to 
permit its estimation from the simple regression model. 

The violation of assumption A2 through covariation of regressor 
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with dlBturbance has serious consequences for least-squares estimation 
from grouped data* Covariation, between the X.. and the u,. can 
occur when the regression model is mi sspecif ied through the omission of 
a variable related to both X and Y * It must then operate through 
the disturbance term* That is, though the regresBion coefficient 
from [3,1] is to be estimated^ a better specification of the processes 
at work is 

where . is the variable "omitted" from [3,1]* Given the above 
Ij . 

specif ication J the least-squares estimator of S^^ from the single- 
regressor model has expectation 

where b.___ is the sample regression coefficient of W on X (cf* 
wX 

Theil, 1957)* 

The ^misspeclf ication becomes a problem when is estimated from 

observations grouped on the omitted variable* By grouping on W (whiu. 
is at least partially masked by the u^^ in [3,l])j the assumption of 
independence of regressor and disturbance is violated at the grouped 
level since the . are related to both the X.. and the u . , . As 
a result, CCX^j^^jU^^) # 0 and B~ from [3*7] is then a biased 
estimator of ■ 

Finally^ in the present examplei the designation of a single 
constant assumptions for the model represent an over^ 

simplification even for the ungrouped observations* Our model does not 
consider the possibility that the Y-on-X slopes differ because of 
school district effects* If differential district effects are observed, 
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the analyst might best eKamine his data in some multivariate way. 

F* Grouping on the Dependent Variable " Ideas of Blalock and 
Hannan 

Before moving to our own approach to estimation from grouped obser- 
vations ^ wn point out arguments by Blalock (1964) and Hannan (1971; 1972) 
that run counter to Craraer and Prals-=Aitchlnson . Both Blalock and 
Kannan have argued that systematic grouping methods can yield biased 
estimators of regression coef f icients. 

Blalock (1964) based his objection to the "no bias" conclusions of 

Prais-Aitchinson and Cramer on the findings by Robinson (1950) and 

others that correlation coefficients are biased by grouping ^ and on the 

relation of regression coefficients to the squared correlation coeffi-* 

cients* His reasoning was as follows i 

1* Groupings which maximize variation in either X or Y 
inflate the correlation: 

> r^ 

2, According to Prais-Altchinson and Cramer j grouping on X does 
not bias the estimate of B t 

3* The squared correlation rj equals the product of the 
regression coefficients b^^ and * 

^XY YX XY 



Similarly , 



2 

XY YK XY 



4. Given the Qbove, It follows that grouping on X inflates the 
regression coefficient; 

- ^XY 
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Blalock's conclusion irom the above was that the regression coeffi^ 
cient is inflated when data are grouped on the dependGnt variable* This 
apparently contradictis arguments that estimates froin grouped observations 
are always unbiased. 

Building on Blalockj Hannan (1972) provided a particularly apt 
descripcion of how bias can arise through grouping. Ue argued that bias 
occurs when observations are grouped on the dependent variable Y . When 
variation in Y is maximized by ranking observations by their Y values 
and then grouping "adjacent" observations , observations that have both 
X values and high u values will be placed in the highest Y groups ^ 
assuming is positive. Similarly^ observations with both low X 

values and low u values are placed in the groups lowest on Y . Thus, 
other determiners of Y are confounded with X so that CCXju) can no 
longGr be expected to equal zero* Hannan stated that this correlation 
betV7een regressor variable and the disturbance violated the assumptions 
and was the result of a specification error magnified by grouping. 
Since the model at the grouped- level is misspeclf led, the least-squares 
estimators are no longer unbiased, 

Blalock*s and Hannan' s arguments are largely intuitive* In the 
next section 5 we present a formal inathematlcal treatment which supports 
the contentions of Blalock and Hannan. 

IV. A Structural Model for Determining the Effects of Grouping 

A systematic procedure is developed for examining the consequences 
of different methor of grouping observations* The procedure is an 
extension of the "structural equations" approach by Plalock (196^) and 
by Hannan (1971; 1972). First an interval grouping variable Z is 
added to the model of [3.1] • In other words, the rule by which the 
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Individual observations are assign-ad to groups is treated as a random 
variable which may be related to other variables in the system. If 
the grouping variable 2 is related- to anQthar variable^ the structure 
will specify that Z is prior to that variable . It does not matter 
that may appear to be determined by, say? X in the sense that X 
would be logically or temporally prior to Z if the three-variable 
model Y fCX^Z) were under Investigation* We visualize the grouping 
process as one in which Z can ^-select" or "force" the observations 
from the bivariate distribution of X and Y into groups* It is in 
this sense that Z is prior to X and Y^» 

The equations for the modified structure are presented below for 
both grouped and ungrouped cases. In addition^ general formulas are 
derived for both grouped and ungrouped coefficients, their estimators 5 
and their variances. Even though we are in%^estigating "a single 
regressor"p we have here a three-variable system where Y can be 
regressed on X and Z . 

Next we consider how the relations of the grouping variable to the 
other variables affect the usefulness of R^- as an estimator of , 
Problems with regard to the scale and distribution of the variables are 
set aside for the moment* A taxonomy will be set out such that grouping 
variables from the modified structure fit into one of several mutually 
exclusive categories defined by the relations of E to X and Y , 

^Thls interpretation of Z is in no sense arbitrary. The process of 
grouping systematically has much in common with the notion of 
selection. In fact, Lutjohann (personal communication) has suggested 
that the grouping bias we discuss is essentially selection bias, the 
result of a manipulated sampling of the observations of X and Y 
because of their association with Z . Recent work by Goldberger 
(1972) on selection bias in evaluating treatment effects with non- 

. random sampling also hints at the connection. 
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As later sections will demonstrate , the use of this taxonomic structure 
enables the investigator to reject many potential grouping variables by 
examining their matrix of correlations and partial correlations with 
the main variables in the study, 

A. Structure with Z prior to X and Y 

The path diagram for the structure when Z is prior to X and 

Y is 




In this diagram^ v is the disturbance term representing all determin-- 
ers of X that are not linearly related to Z 5 and w is the distur-- 
bance term representing all determiners of Y that ^re not linaarly 

related to X or Z , ^yX*Z* ^YZ*X * ^XZ P^^^ regression 

coefficients. 

The equations corresponding to the structure with 2 incorporated 
can be written 

[3a4a) Y ^ a + Byx*Z^ ^ Sz-X" + ^ * 

[3.14b) X - X + $^^Z + V 

We recall that B„„ „ * and 0 refer to regression parameters 

in a system with several variables. Even though we include the grouping 

variable Z ^ this is^ an equation at the individual level* every person 

has a Z i w and v are disturbance terms with zero expected values 
P 

for all persons; w Is assumed to be Independent of X, Z 1 and v ; 
and v is assumed to be independent of Z • We further assume that 
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both disturbauce terms are liomciscedastic (i.e., for any two persons, 

- ^2 ^^2^2 ^ q2 ^ independent. (This implies that 

for any two persons, a ^0) and o » 0 * 

Although we again write ot for intercept term in [3,14a) ^ its 
value may differ from that in earlier equations. We let X represent 
the intercept term in the second equation of the structural systein. 

Equation [3,14b] can be substituted into [3 -14a] to obtain a 
single equation for the regression of Y on Z and v i 

[3.15] Y - (a + 0^.2^) + CSyx.zSxZ + hz^X^^ + ByX-Z^ + " ' 

Equation [3*15] is actually a repararaeterization of [3*1] where X hai 
been divided into two parts ~ the part predictable from the grouping 
variable Z and a residual part v , Equations like [3,15] ara 
generally called "reduced-form'* equations* This means that [3*15] is 
in a form that cannot be reduced further by substitution of other 
equations from the structural system* Later on^ we use reduced-forra 
expressions to simplify our analytical work* 

In Table 3*2, expressions for the population variances and co- 
variances of the variables in equations [3*14a] and [3.14b] are 
provided* The corresponding reduced-form versions are enclosed in 
brackets* 

The regression coefficient relating Y to X ^ — the ratio of 

o to o| as given in Table 3*2 is equivalent to the coefficient 
XY X . 

given by [3,1], As can be seen, that ratio involves the three 
regression coefficients C0yX*Z - Sz*X * ^XE^ variances 

o| J 0~ , and CT^ , 
Z * V ' w 
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By substitution (of the reduced-form expressions from Table 3.2) , 

"we' get ...... ■.,■-■'.'•.■, . „■'■..■■ ■■■:■•,, .. ., 

[3.163 I™ « — 

■ ■ '" 4 

a - - - - - ■ ; 

^XZ Z "v 



ft 



B. Revised Structure for Weighted Group Means ^ 
The structural equations for the means of groups with uniform Z 
can be written as 



[3.17a] Y ^ c + ByX-Z^ ^/%*X^ * ^ ' 



[3.17b] X ^ X + B Z + V 



These equations are the same as [3*14a] and [3*14b] except that grouped 
quantities have been substituted for their ungrouped counterparts* In 
addition to the intercepts, there are still six parametersi Sy^^^^ * - 

^ s 0^^ 3 al I oS 5 and oi . Note that we specify the same 
i^'X^ XZ I g V . w ' 

regression parameters as in [3.14 ] , since averaging bbBervations 
within groups does not alter the model underlying the generation of 
observations* (This is analogous to Cramer's assumption discussed in 
Section III.D though now we operate with a more correctly specified 
model*) 

Table 3.3 contains the population values for the variances and 
covariances of the variables in equations [3*17a] and [3.17b], The 
reduced forms are again enclosed in brackets. - 
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Table 3.3, Govariance niitrix for varlibles in iquatloni [3.17a] and [3,l?b] (Riduced foris in brickiti) 



Variable 


? : 


X 


z 




V 


? 


1(L,,L, + 8„ hL«- + C-] 


- 




f ■ 




X 


■J , • ' ■ , 

1 . 

2 2 2 


2 

V 










2 2 
J g- + B 0- i 
IZ'XZ IX'ZIZ 

[{L. ..L, + L, jhk 
^ Will U'K V 


li- 
fe 0-] 

- IZ 1 ^ 


2 






w ■' 


I 

0- 

w 


0 


, 0 


2 
w 




V 




2 

(j- 


0 


0 


2 

V 
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By substitution of the rGduced-f orm expressions from Table 3,3, 

the regression coiafficlent rela.tlng Y to X ~ the ratio of o~ to 
2 

— can be written as 

°YX 

[3.18] ^l-^ 



ft 

^Cgmpmring [3*16] and [3.18] ^ wa see that and 0_ differ in 

that between=group variances replace total variances, t^en our sample 
constitutes the entire population (the first case in Table S.l), the 
discrepancy, or biaSj 6 ^ can be found by substituting from [3.16] and 
[3*18] for the appropriate terms in [3*10] i 



[3.19] e = 0fx = ^YX 

" 2 2 

= B ■ B I a- °i 
- YZ'X'^XZ \ 2 2 

C. Estimator of 0 from Individual Data 

m 

Under _the_modified structure ^ a simple random sample of N( 2 n^, ) 

. . .. i^l ^ 

observations is drawn from the trivariate distribution fCX..5Y,. Z,.) 

generated by [3*14a] and [3.14b]. The sample regression estimator of 

is given by 

m n, , 
E E^CX. . » X ) (Y. - Y ) 

[3-10] b ^ 1 _________ 

^ YX m n. 

E Z^CX. . - X )^ 

i^l J^l ^^^ 



Ek2 
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where x ^ . X and y - Y. . = Y are deviation scores and 
summation ±b over all N persons. 

Equation [3<20] is a double-scripted version of equation [3 . 3] . 
An eKpression for the expected value of b in terms of parameter's of 
the modified structure is found by substituting [3.14a] for Y^^ in 
[3.20] (all variables in deviation form) and taking the expectation: 



[3,21] 



Sxy 
Ex2 



m 1 



Zx2 



^6 + S El 
■ YX-Z - YZ'X ' 



Ex2 



Equation [3 #21] is in a form that cannot be simplified without 
additional assumptions since , by [3*14b]s ,3? and z may be related. 
We can, however, examine the asymptotic properties of the expression 
under the conditions that both Exz and Ex^ exist and Ex^ is non- 
zero. By the Strong Law of Large Numbers 



plim 



Exg 
£x2 



= plim 



K 

Ex2 



plim 



Exz 
N 



where plim denotes the probability limit ( lim ) of the enclosed 

- ^ CO 



^ I am indebted to iProfessor Julius Blum for pointing out that the 
Strong Law of Large Numbers is useful in this situation. 
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expresaions. The right-hand side can be further simplified since 



and 



pllm 



xz z 



since v and z are Indepandant, 
Therefore, 



[3.22] 



where, as expected p the right=hand side of [3*22] is the same as the 
right-hand side of [3.16]. 

The variance of b under the modified structure can be written 



as 



[3.23] V (b^jj) = E [h^^ - E (h^^) ] 2 



=. 1 



ZxZ 



%.Z ^YZ 



Substituting [3.10] and [3.21] in [3*23] 



^ E 



6 +6 E 



nl2 



and, after substituting the deviation form of [3.14a] for y ^ 
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By eKpandlng the right-hand side and applying the assumptions that w 
is Indepandent of x and % and ECw) - 0 , [3,23] can be further 
reduced to 



- E 



2x2 



Ux2 



The last term in the above expression Is equal to 0^E[ 



'] by the 



same reasoning we used to derive VCb™) (Equation [3*5]) In Section 

III. A. Also for the time being, wa shall use the fact that — — is 

2x2 

to simplify the 



the expression for the least-squares estimator b 
equation for the variances 



zx 



ss^(x) 

It is difficult to simplify [3*23] further because x is a func- 
tion of both z and v under the most general conditions. Later^ we 
examine the VCb ) under conditions where Z is assumed to be unre= 
lated to X 5 to Y*X , or to both* In these cases, the expression for 
thcE variance of the estimator of 8^ from ungrouped data can be 
simplified, 

D* Estimator from Grouped Data 



The ¥ . and X , , from tha sample of N observations drawn from 

the trlvarlate distribution * ^ij^^ grouped on the\basls 

of the values of E^. -* Each observation is than replaced by the' group 

mean corresponding to its Z^^ value; that is, X^^ replaces X^. and 



In this treatments ^ Z, so that cr^ - d| 



Y . replaces Y . . 
Furthermore, we assume that the group sizes in the s^ple — the ti^*s 
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are proportional to the group slaes- in the population -#o-i that bias . has 
not baen introduced through non--proportionate sampling from groups, . 



The equation for the sample regression coefficient B~ can be 

m n . 



written asi 
[3,24] 



1-1 j-1 i^ ±* 



YX 



m n 



i.G 0 .2 



i-1 j-1 



where lower-ease letters denote deviations of group means from the grand 
means of the sample and suranation is over all N observations. Though 
written in a differant manner j equation [3*24] is simply the double- 
scripted version of [3.7]* 

We now follow the same procedures used in Section IV . C for -jmgrouped 
data in order to find the expected value of the sample estimator from 
grduped data under the modified structure. Substituting the deviation 
form of [3.14a] for y in [3.24] and taking the expectation ^ we obtain 



[3*25] 



- 1 



^ S + B E 




+. El 



since x and w (and x and w) are assumed to be independent and 
ECw) ^ ECw) ^0. 

By the sme reasoning used to dprlve [3.22] ^ it can be shown that 



asymptotically 

[3.26] pliinCB—) ^ B 



YX*Z 



^ Sz-x^xzl J 



The right-hand side of [3.26] is the same as tht ,Tigbt-hand side of [3.18] 
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An expression for the variance of B~ under the modified structure 
is found in the same fashion as^ V(b ) in [3.23]* It can be shovm that 



[3.27]., V,(B==) 



- E 



Ek^ 



+ E 



SSgCX) 



SSgCX) 




where E~ is the least-squares estimator from the regression of Z . 
on X over all N persong* 

The only differences between the equations for grouped and ungrouped 
coefficients ([3*16] and [3.18]), their sample estimators ([3.21] and 
[3.25]i also [3.22] and [3.26] for the asymptotic eKpressions) , and 
sample variances ([3.23] and [3^27]) are that sums of squares and 
variances of the group means of Z and X replace the sums of squares 
and variances of the corresponding ungrouped observations* Andp since 



at - and SS_(Z) - SS (Z) under the modified structure, the only 
Z B T " 



substantive changes involve variation of the independent variable. 



b™ and B~ have been shown to be asymptotically unbiased astima- 

tors of S™ and respectively , but the investigator wants to ' = 

estimate 3^^ from 0~ (when the sample equals the population) or from 

B~ . In Section V.B we shall Identify the conditions under which 

$~ ^ 3 and B~ is an unbiased estimator of S™ * 
YX iX YX lA 

. E, A TaKonomy for Classifying Grouping Variables 
A taxonomy for comparing grouping variables can be formed by setting 
various combinations of v ^v^ [3*14a] and [3.14b] equal to 

zero. The categories of the taxonomy reflect different sets of con- 
straints on the relations of Z to Y and X . Four categories of 
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grouping variabies can be distinguished i 

I. Z is directly related to both X and Y-X (B^^ ^ ^ 0, 



II, Z is directly related to Y'X but not to X (8^2. ^ ^ °' 



^xz = 0) . 



III. Z is directly related to X but not to Y*X (8^^.^ ^ 0* 

IV* Z is not related to either X or Y-X i^YZ*lL^ ^XZ^ 
Figure 3*1 presents the path diagrams corresponding to the Gategories of 
the taxonoray. 

The categories of the taxonomy include all possible linear rela= 
tions linking prior grouping variables to the regression of Y on X . 
Certain of these categories represent broader classes of variables* For 
instance i any random grouping procedure will satisfy the conditions for 
Category IV. Groupir ^ on the regressor X is a special case of Gate-- 
gory lilt Most systematic grouping variables belong to Category X. 
Grouping on the dependent variable Y is a special case of Category I, 
Any grouping variable can be uniquely categorized if the variances and 
CQvar lances of X ^ Y j and E are known. 

Under certain conditions discussed in Chapter 1^ however, no un*- 
grouped estimate of o is available. To see this, suppose that data 
on X and Z are collected anonymously on occasion 1 and data on Y 
and Z are collected on occasion 2. Then s * °x * ^XZ * 

^ can be estimated directly from the data* But there is no natural 
way to pair X and Y scores, and a and thus B cannot be 
estimated directly* When this occurs, the investigator can estimate 

and , but not 0^= „ , He can often guess whether B„„ ^ 
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(a) Citegory I 



(b) Category It 





(c) Citi|ory III 



(d) Citi|ory IV 





(2) 



Figuri 3,1, Path Diagrairj Corraipondlng to the Citigofiei of the Taxonomy, 
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non-^^ero, and by doing so can judge whether grouping by E will yield 
unbiased and effiGient estimates of By^^v * Chapter 6, we shall 
offer suggestions for grouping when o is unknot (cf.. Section II. C 
of Chapter 6) , ^ 

V- Bias and Efficiency as a Function of Taxonomic Category 

We examine how the relations specified for the taxononiic eategpries 
can affect the bias and efficiiency of the regression estimates from 
grouped data. First* the general formulas for bias and efficiency from 
Section III.C are developed for the modified structure. Then the 
implications of this formulation are considered foff each pategory 
separately* . ' ' 

A. Bias and Efficiency Formulas 

In Section IV- B* we preBented the following expression for the 
discrepancy (bias) that results from grouping when the sample constitute 
the population: 

[3.19] 8 - Pyx - % 

Iv^^.en the data are a subsample from the population, the asympotic 
expression for the bias from grouping found by comparing plim(B^^) 
(Equation [3.26]) with plim(b ) (Rquation [3,22]) ;haa the same form 
as [3.19]: 

[3.28] pliraCd) = plimCB^j) - pllmCb^y) 

2 2 

ft fi 
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Also, coinpuring [3.25] with [3.21], the expectation of the differ- 
ence between B-^^ and is given by 



[3.29] 



- YX.Z ^YZ-X I j-2 



ExE Eke 



Since 



ij i- 



Similarly 5 



1=1 j=i 



m 



1=1 ^'3=1 



m _ _ 

E n ,x , z , 
11* i ' 
1=1 ^ ^ i 



- Eke 



Eve = Eve 
So [3*29] can be written as 



[3.29] E(d) « .^if M£ - i££ 
YZ-X 



2 Zx2 



" ^YZ-X^ 



Zx2 2x2 



^YZ-X^ 



1 \ 



0x2 Zx2 ) 
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Ex 2 



Ex 



.2 



+ 0YZ-X^ 



With the exception of the last terms [3.29] now has the same form 
and components as [3-19i and [3,28], In each case^ the bias term has 
essentially the same straightforward interpretation if the between^-group 
and total variation of X are both non-zero. The grouping of observa- 
t-^:ons leads to biaied estlmatiQn if all three of the followinR conditions 



holds 



(a) The grouping variable Z has a direct relation to 

(b) The grouping variable Z has a direct relation to 
Y.X (By^.x * 0) • 

(c) . The ratio of the between*-group variation of Z to the 

between-group variation of X does not equal the ratio of 
the total variation of Z to the total' variation of X , 



Furthermore J since Z has been defined so that Z^j 



^ Z, 



we can 



rewrite [3-19] and [3,28] as 



[3,19*] 6 ^ ECd) - e^g.x^xz^z^ 



and 




[3,28'] plimCd) - 

X X 



(sample = population) 



(sample # population) 



ThuSj condition (c) can be restated as 

(c*) The between-group variation of X does not equal the total 
variation of X , 
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Other things being equals the magnitude of the bias from grouping 
increases directily as the relation of Z to X or Y*X increases or 
as the variation of X is reduced by grouping. These three conditions 
are not independent; in the next section , we explore some ramifications 
of their interrelation* 

■ The formula for the efficiency of b^^ relative to B~ as an 
estimator of g can be found by Bubstituting from [3*19'] 5 [3,23] , 

I A 

and [3.27] into [3.12]: 



[3.30] EffCb^^.B—) * 



YX' YX'^ MSECB-^) 

I A 



SS^(X) 



6^ fi" 0? 
■ YZ-X XZ Z 



4 - 



2 2 
XX 



For certain categories, this complicated eKpression will slinpllfy 
greatly as Section V,C will show. 

B. Examination of Bias for Each Category 

Equations [3.19'], [3,28'] and [3-29] can now be used to examine 
each category of grouping variables for bias. The taxonomic categories 
are considered in order * - 
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1, Category X — Z directly relaCed to both X and Y-X . 

Category I includes all grouping variables which have direct 
relations to both X and Y«X . An obvious example is that scholastic 
aptitude (Z) may be related to achievement (Y) and to student academic 
Interests 

A more complicated axaniple occurs when two distinct claasifica=- 
tions are made on the Bame achievement measure | for example, define Y 
as the observed score on achleveinent and Z as the decile rank on 
achievement. Thus Z will most likely be a Category I variable. 
The broader classification for Z creates a measure whose correlation 
with Y is other than 1,0 or 0 after X is partialled out* If 
Y is linearly related to X ^ Z will also be related to X , 

In general, the slope estimated from data grouped on Category I 
variable is a biased estimate of 6^^ # The magnitude of this bias is 

given exactly by [3.19'] for known values of Pyz*X * ^X7 ' ' ^^'^ 

2 - _ _ _ - 

and can be approximated by ,[3*28'] and [3*29] when the sample does 

j\ 

not equal the population. 

Thus, when Z is a Category I variable, bias is given by the 
general equations i 



2^2 

[3.19'] ' ^ - ^z^^^U^^rr^ 



(sample = population) 

gZ _ z' 

[3.28'] 0=pnm(d) = Pyz.xSxzil-Vr'' 
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(sample ^ population, K ^ ®) 
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(sample population) 

Section V*A has already discussed the conditions under which bias 
occurs. In Chapter 6 we shall examine the bias of the slope estimates 
from several grouping variables by substituting empirical estimates of 
the model parameters into equation [3.19']* 

At this point, however ^ wa can get some idea about the bias for 
Category' I grouping by examining the bias in estimated coefficients 
when the variables from the ungrouped model have been standardized 
before grouping*" Assume that the X^^ ^ Y^^ , and...Z^^ are 
standardized. Let m 
values of Z so that 

(1) 4 = 4 = 1 

^The practice of etandardiEing the variables before grouping serves 
two useful purposes. First, it places the regression coefficients 
on a uniform scale (0 to 1.0) , Second, the coefficient from the 
regression of Y on X when both have unit variance equals the 
correlation between Y and X . This suggests a potentially 
useful way to estimate zero-order correlation coefficients from 
grouped data is to regress Y on X when tHie ungrouped variables 
have been standardised* 



[3.29] e « E(d) = B^^,^^ 



s ( - £x^ \ 



groups of equal size n 

Z. . ^2, . Under these 
1] !• 



be formed on discrete 
conditions , 
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(3) oi=aj/n . 

and 

where n is the number of "observations per group (held constant over 
groups) . 

After substituting (1) and (4) in [3*19' we obtain 



[3.31] e*^E(d^) -^^^.^^ 



(n-1) Cl-p|^) 



(n-l)6|2 + 1 



where d* denotes the discrepancy from estimating the regression coeffi= 
cient for standardized observations from grouped data. 

At this point we consider how the discrepancy varies according to 
the relations of Z to X and Y and according to the number of 
groups forined. To do this we assume that there is a pool of grouping 
Variables j Z's , which have been standardlEed and have varying 
relations to X and if Cpotentially different Pv-? v )• 
For simplicity we let the number of * groups fomed by a given Z vary 
according to the chosen grouping variables but we assume that equal 
size groups are formed* 

In Figure 3.25 blagj 0* , Is plotted against ^^ith Byz-X 

fixed at .1 for selected values, of n , where N ^ nm is held 
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constant. A comparable family of curves can be genaraUed for any value 
of By^*X ' curves are roughly symmeti^lcal for small n and 

become highly positively skewed for large n • This is as eKpected 
since the groupings becorne coarser and less representative of the 
ungrouped observations as n gets larger, for any set of fixed rela- 
tions between Z and X and Y'X * 

Table 3*4 indicates the bias 8^ for several values of standard- 
ized ^YZ*% ' standardized S^^ » n . An examination of the 
tabled values leads to the following conclusions: 

1) For any fixed values of Sv^^.v ^v^ * bias increases 
with n (except SyZ-X ^ ° °^ ^XZ ^ ^ ' 

2) For fixed (not 0 or 1) and n ^ bias increases with 
^YZ*X • 

3) For fixed 6^^.^ (not 0 or 1) and n ^ bias first increases 
and then decreases as 0 goes from 0 to 1* 

Minimising the direct relation of 2 to Y»X and maximizing the 

direct relation of Z to X is the safest way to reduce small bias, 

0 approaches its maximum rapidly even for small values of n » 

Large n is less damaging when g „ is large and ^ is small, 

though the necessary value of 0 increases rapidly with ^ . 

X^ ' YZ*X 

For n ^ 500 and SyZ-X ^ * ^XZ ^^^^ grfater than ,60 to have 
bias less than .1 . For n = 500 and Bv^-v ^ *2 i $v«7 »iust be 
greater than .78 to achieve the same results, ' 

The bias from Category I grouping can exceed 1 with large n 
and Pyz-X ^ ^XZ * This should be a further warning against choosing 
a grouping variable strongly related to Y*X and against concentrating 
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Tabla 3.4 Bias 0* In estinialing standardiEed regression coefficient 
B,.y from grouped data as a function of group size, 
standardized 6„„ „ and standardized R . 









9' 


* - Magnitude 


of the 


Bias° 










Group 
Size 






o 
. i. 




.5 


.8 


n 


Q 

hz = 




e 

. 5 


.8 


.2 


.5 


.8 


.2 




.5 


.8 


2 




.037 


.060 


.035 


.093 


.150 


.088 


.148 




.240 


.140 


4 




.103 


.129 


.059 


.258 


.323 


.148 


.412 




.516 


.236 


5 




.132 


.150 


.065 


.330 


.375 


.163 


.528 




.600 


.260 


11 




.274 


.214 


.078 


.685 


.535 


.195 


1.096 




.856 


.312 


20 




.415 


.248 


.083 


1.038 


.620 


.208 


■1.660 




.992 


.332 


50 




.636 


.I'll 


.087 


1.590 


.693 


.218 


2.544 


1 


.108 


.348 


100 




.766 


.288 


.089 


1.915 


.720 


.223 


3.064 


1 


.152 


.356 


500 




.914 


.298 


.090 


2.285 


.745 


.225 


3.656 


1 


.192 


.360 



YZ'X- xz 



(n=l) (1=6^2) 
(n-l)4 + 1 
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observcitions in a ft'W large groups. On the other hand, the rolatlvely 
small bias expected with small ^yz*X some hope for reasonable 

estimates from data grouped by Category I variable, 

2. Category II — Z directly related to Y*X but not to X 

Category II contains grouping variables Z which are related to 

Y (B^g-X ^ related to * X (S^^ " ^5 . Since ^ ° ' 

Exz - d s and regardless of whether the sample equals the populationp 
the bias / 0 ^ ECd) ^ 0 i as long as Zk^ # 0 , 

Thus estimates derived from data grouped by a Category II variable 
are unbiased unless tViere is no between^-group variation in X # This 
conclusion is not surprising. When Z is a Category II variable ^ we 
are considering the standard model of equation [3.1] where the *'other" 
determiners represented by u have bean divided into two parts (Z and 
w) J both independent of X * Unbiased estimates are eKpected under 
these conditions. 

It is possible to have no between--group variation in X for a 
Category II variable. This occurs when the grouping variable lies in 
the XjY plane; i.e., if y ^ ^ ' ^ ° 

since ^ 0 , the bias from grouping is indeterminate as can be 



seen by substitution into [.3.19-]l 



0 ^ B..^ ,3„^o 



2 2 

2 , 



YZ'X XZ Z V 2 2 



,2 / X 



o| - (0) 



YZ-X Z\ ^a^Qj 



Theufe*, is no simple way to conyider furthei the magnitude of the bias. 
There is some evidence based on simulation studies that bias estimates 
fluctuate wildly In this special case* Q — - 
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Catogory II variables are Imrd to find. None of the more than 200 

pairs of parameter estimates ^ v ^^^^ ^v-^ > ivon the einplrical data 

YZ * X XZ 

discussed in Chnpter 6 satisfactorily meet the conditions for Category 
II grouping. Such variableo could be constructed by orthogonalization, 
but other categories of variables yield unbiased estimators with greater 
efficiency. Henceforth, Category II will receive little attention. 
3. Category III — Z directly related to X but not to Y*X 

Category HI includes variables which are related to Y only 
through X ~ Systetnatic grouping on the Independent variable falls in 
this category, A Category III variable may be an explicit ordered 
function of X such as the decile rank of X ^ and if uo, the within- 
group distributions of X do not overlap. It is also pDssiible that a 
Z from Category III involves some landom component (v) which allows 
the within-group distributions of X to overlap. The presence or 
absence of overlap is irrelevant In the determination of bias, but it 
can affect efficiency. 

Since ByZ'X ^ ^ Category III, equations [3il4a] and (3.17a] 

reduce to 



and 



Y ^ a + B^j^X + w 



Y - a + B^j,X + w 



These equations are the same as [3.1] though the disturbance terms have 

been relabeled. Thus for Category III grouping , the standard model and 

our modified structure with th6 grouping variable Incorporated are the 

same J and estimate the same S • 

YX 

From equations [3.19*], [3.28' ]s and [3.29'], it follows that whan 
Z is a Category III vari.iule, : 

02 



■plim(B~) = plimCb^^) ^ g^^ , 

and 

e * E(d) ^0 

Thus the least-squares estimators of 0 from data grouped on a 

vai able E which is related to X but not to Y*X are unbiased for 

any value of B^^ * 

The bias and efficiency resulting from grouping by a function of 
X (Category III grouping) have been studied eKtenslvelyj the most 
prominent being the Prals and Altchlnson study (1954). (Most variablei 
systematically related to X do not strictly satisfy the condition 
^YZ*X ^ ^ thus exhibit some minimal bias.) Our conclusions con- 

firm those of earlier writers that Category III variables yield the 
best estimates under a very general set of analysis: situations. Th6 
estimates are always unbiased and can be highly efficient (see Section 
V*C.2). If such variables do axist in a study^ the remaining decision 
should focus on choice 'among Category III variables , and, once a 
variable is chosen , on the definition of the classes. These problems ^ 
are considered In Chapter 4 under the heading of wlthin^category 
factors. 

4, Category IV — Z not linearly related to X or Y»X . 

^Sz.x = ° ' 8xz = °^ • 

Category IV contains all variables which have no linear relation 
to either X or Y , A Category IV variable can be generated by 
assigning numbers randomly to individuals, such as a student ID.. 
Category IV grduplng, alternatively called random grouping generates 
random groups of (X,Y) bbservations . 
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vmen Py?..X " ^ ^yz " 0 , It follows tnat 



and 



HencG, 

8 « ECd) * 0 

for any Categor}^ IV variables B" is an unbiased estiTnator of 

^YX • • ■ - ^ 

The interpretation of this result is straightforward. Estimating 

B^jj from the means of m randomly formed groups is statistically 

equivalent to estimating from a sample of size m drawn randomly 

from the N observations or from the m stratum means where the strata 

have been randomly formed (Hansen, HurwitZp and HadoWj 1953). In either 

case, the random procass docs not alter any pre=existing relations among 

the variables. All variances and covariances among variables decrease 

in proportion to the number of observations in a group for fixed group 

size n for Category IV grouping. This proportionate reduction in 

magnitude does not alter the estimate of the regression coefficient. 

Category IV variables are not the best choice for grouping when 
efficient estimates are desired because of the difficulty of obtaining 
an adequate number of groups to overcome the marked efficiency reduction 
(see Section V.C.I.)* In certain instances^ however ^ Category IV 
variables may be the only recourse for the investigator who has limited 
information about other ways of forming groups • 

C. Efficiency Considerations 

Equation [3,12] defines efficiency. Below we evaluate the effi- 
ciency for each category of grouping variables. 
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1, Cntcgory IV 
For CatGgory IV variables, since 6 
equation [3.11*] becomes 



^0 and 6 - ^ Q • 



E[ 



E[ ^ ] 



Several investigators have already provided simplified expressions 
for the efficiency of random grouping undax the assumption that the X 
are fixed and given* An especially cogent derivation by Feiga and 
Watts (1972) is presented below^ using our terminology and notation. 

Feige and Watts' derivation is based on the theory of sampling from 
a finite population. The set of N observations is regarded as a pop-- 
ulation. If the observations are assigned randomly to m groups of 
n^ in each groups many groupings are possible. The expectad within- 
group sum of squares for the ith group is SS^(X) [ (n^^l) /(N^l) ] * 
Therefore, for Category IV groupings the expectation of the total sum 
of squared deviations from the group means (the within-group sum of 



squares) is 



E[SS_,(X)] * E 
w 



m n , 



E E (X. . - X. )^ 



m 

E 
i-1 



ss^( 

T 



From tliP formula for the decomposition of the total sum of squares, 
the expectation of the between-group sum of squares for Category IV 
grouping can be written as 
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E[SS^(X)[ - |fS E(SS^,(X)] (substituting from 

above) 

== E[SS„(X)] . 
N^JL X 



and 



If the X. , are fixed and given. 
ElSS^CX)] - SSgCX) 



E[SS^CX)] - SS^(X) 



AlsOi in Section IIi,Ds we snowed that if , in addition^ B-^ is an 

La 

unbiased estimator of B^^ s the efficiency of grouping is given by 

^ SS^(X) 
EffCb^^.B^^) ^ 



YX' YX- SS^(X) 

Hence, by substituticn for SS^CX) ^ the efficiency of Category IV 
grouping when the X^. , are fixed and given is equal to 

EffCb^^^B^^) SS^(X) 

At best — when there are m ^ (N/2) groups of two observations 
each the efficiency of Category IV grouping is only about sS, under 
the assuniption that the X^^ are fixed and given* However^ the effi- 
ciency of random grouping provides a standard to which we can compare 
the efficiency of grouping in a systematic manner. Only those estimates 
with efficiency greater than (m-l)/(N«l) offer an improvement over 
random grouping, 

2, Category III 

Category III grouping can produce small values of E[ ^ 



SS„CX) 

b 
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because such groupings presumably assign observations to groups in part 
on the basis of their X values* Since raaKiTnization of the betV7een- 
group sum of squarea is a criterion for minimizing Inf or iation loss 
through grouping, expecx Category III groupitig to yield relatively 
efficient estiinates, 

Prais and Aitchinson (1954) and Cramer (1964) have Gxamined the 
efficiancy of grouping on the independent variable under the assumption 
that the X^j are fixed and given. While they discussed grouping in a 
seemingly general way, their methods and conclusions are applicable to 
our Category III variables, Prais and Aitchinson presented a particu- 
larly illuminating example. They let X take on the mn equally- 
spaced values, X^j - 1, * * , ^ mn , Then adjacent observations were 
grouped into m groups of equal size and each value of X^^ was 
replaced by its group mean X^^ ^ Therefore, 71^. takes on the valu'7S 
[(21-l)i + 1/2] where 1 ^ 1, m . 

SS^(X) . Si^l 

for the ungrouped observations and 



fov the grouped values. IJheiicGi 

Eff (b^,^,B-jj) « 



...2n2-.l 
= 1 - 



tben» 



> 1 --i 

m 

in this special case of Category III grouping with fixed X^. , 
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[For n < m ^ Kff(b jB-^) Is also Bfeattar than (1 - — ) .] 

Thus the lower bound of ths efficiency of grouping related Co X under 
these conditions depends only on m , the nuinber of groups fornied. 
Unfurtunately J the distribution of observations seldom approaches this 
special case;. The conditions under which the efficiency of other 
Category III vcriables approach this case ara discussad in Chapter 4, 
3* Category II 

In Category II grouping, 2 and X are stochastically lndepen= 



dent. Category II variatfte ftare thi^^rpperty CS„„ - 0) with 
Category IV variables* Since the efficiency %f grouping is a function 
only of the variation of ^d X when the estimators are unbiased^ 
the efficiency of Category II grouping is the same as for Catngory IV 
grouping. That is^ when SS^(X) u j we expect Category II grouping 
also to have efficiency on the order of Cm--1)/CN-1) ^ the ratic of the 
number of groups to the number of observations* It appears that neither 
Category II nor Category IV grouping yields estimators that approach the 
efficiency of the estimators from Category III, 
4* Category I 



Wlien Z is a Category I variable , both bias and variance of B^^ 

affect the ef f iciency of "estimation r"^'TK " 

efficiency of grouping for this category of variables. In its simplest 

fornij the efficiency of Category I grouping is given by 

V(b ) 

[3.12] Eff(b^,j^,n--) = 
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If we again assume that Liie X., are fixed and t;iven. 



a2 

^7 



a2 

YX SS,, CX) 



and thus [3.123 can be writUen as 
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13.32] Efnb^^.Bjg) 



W 1 



w B 




!V5 

SS^(X) 



That is, the correlation ratio is an upper bound for the efficiency of 
Category I grouping whan the X^^ are fixed and given. 

One implication of the above is that grouping by a Category I 
variable is never more efficient than grouping by a Category III varl-- 
able with comparable GS_CX) , But since grouping randomly provides a 
lower bound for the efficiency of gi^ouplng when is an unbiased 

estimator cf , Category I grouping can be more efficient than 

random grouping \;hen 6 is small. 

For example, aBSume that 50 equal-size groups of 20 are formed. 
Let 0,,,,^^ ^ -5 5 Syy.Y ^ *2 , and B^^ « ,8 . Also, assume that 

Then, after Bolving for w Iti [3.14a3 and 
remembering that w ^ is unrelated Co X and Z , we have 

r2 _ 0^ ^ 20 S 0 



YX*Z ^ ' ^YE*X ' ''KZ 
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== 1 " (.5)2 - (.2)2 - 2(.5)(.2)(.8) 

= .55 



Also, fi-om formula (4) on page 73, 

« [(19)C.8) + l]/20 
= ,650 



Hence 



and 



n| = .658 



SSg(x5 = (999) (.658) = 657.34 
From Table 3, 4 we get the predif.tf5d biaa for our chosen values of 
^YZ-X^'^^ . ^xz^'^^ n(20)i Q = .083 (e'- » .007) , 

Substituting the above in [S.SZjj we get 

^"^Nx'^YX^ ^ .55+ (.d §7) (657.34) ^^'^^^ 
« <-107)(.658) 
^ .070 

In comparisons the estimated efficiency from foi^ming SO gi'Dups of size 
20 randomly is 

Ef £ (bY^iiB^j) ^ £ 
999\ 

^^""'"^^ ---- • - - ■ o .049 ■ " " " ^ " ' " 

Thus it iB possible to improve efficiency relative to rmndom grouping 
by grDUpinji on a variable which yields small bias but is strongly rela-' 
ced to X . By similar reasoning, we conclude that in certain cases ^ 
Category I grouping can yield more efficient estimators than Category II 
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grouping also* 

VI * The Tax o nomy as A Uulde for Investigation 

The main implication from the above discussion is that the inves* 
tigattir should consider tha relations of the altertiative grouping 
variables to the study variables before collecting his data^ using such 
prior knowledge as is available. This will enable him to collect 
information on only those grouping variables that yield estiiiiates hav'^ 
ing the desired properties,, 

If the Investigator demands an unbiased estimate of g ^ then, 
under the assumptions of the models, vsiriables from Categories II, III, 
and IV can be satrsf aGtory * While Category IV variables can always be 
created J they are relatively inefficient # Category III variables can 
be highly efficient, yielding large values of SS^(X).. * The efficiency 
of Category II grouping is no better than that of Category IV grouping 
because observations are assigned to- groups essentially randomly with 
respect to X ^ Category III variables are clearly the best choice for 
data aggregation. 

Category I variables yield biased estlinates though the bias can be 
small with large B„„ and small g * Category I estimators are 

less efficient than Category III estimators but can be more efficient 
than those fronj Category II or Category IV grouping. If small bias is 
tolerable and Category III variables are hard to find, Category I 
grouping may be advisable* 

Most of the discussion has assumed that an investigator has the 
original observations and can choose his Q\m grouping procedure* Data 
can be available in aggregated form only, however^ e*g*i when Indivi- 
dual data have been aggregated for economy of storage or for confident- 




ss 

tiality. The grouping variableH that generally appear under these 
circumstances are geogrnphic variablas such as ''state" and "census 
tract'% and systera delimiters such as "school" and "classroom" • These 
grouping variables arc generally related to X and Y'X and hence 
are Category I variables* Regression estiTnates determined imder these 
conditions should be Interpreted cautiously. 
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CHAPTER 6 

ADDITIONAL CONSIDERATIONS IN THE SINGLE-REGRESSOR CASE 

Until nowy the diseussion has concentrated on the effects of tlie 
linear relations of the grouping characteristic to the main variables 
on the precision of estimation from grouped observaclons . Other pro- 
perties of the grouping characteristic the number and size of the 
groups it generates J its distributions its scale of measurement need 
to be examined* Here we describe how these within-variable properties 
or factors affect the "utility" of a possible grouping variable. 

Under the heading of properties of the distribution of observa-- 
tlons, we consider the coarseness of grouping, the distribution of 
observations umong the groups , and the distribution of the values of 
the indepGndent variables both within and among the groups. These 
factors can often be manipulated by the investigator to iinprove 
estimation procedures. 

Then J under the heading of scale of measurement, we discuss several 
methods for handling nominal^ characteristics, such as school con- 
sus tract. Such characteristics are of vital concern in recent educa^- 
tional investigations (see Averch et^ al* i 1972) . We consider in detail 
two related approaches to the problem. One approach [suggested by 
Wiley (personal communication) ] provides a general scheme for classify-' 
Ing grouping variables on f:he basis of the scale (incerval or nominal) - 
and the type of variable (fixed or random). The other approach employs 
dummy coding to generate dlchotomous variables to represent the grouping 

1 The discussion also applies to ordinal characteristics which are not 
c^^ansformed and treated as Interval - 
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characLeristic* The in\?estlgcitor then examines hov; properties of the 
dummy variablfis affect the proportion of variation accounted for in the 
model. This discussion rellas less on formal Tnathematics than the 
preceding chapter. However, our exposition is tied conceptually to 
historical developments in the mathematics of scales of measurement and 
distribution* For our partg we are attempting to elaborate how the 
properties create distortions in empirical investigacions of aggregated 
data* 

I* DistributlDnal Factors 

In Chapter 3 we Indicated that alternative grouping variables can 
be generated from a single grouping characteristic* Each grouping 
variable provides a unique classification of the individual observations 
Thus, if groups are formed on achievement quartileSj the "grouping 
variable" is fout^valued* There lis one for each, quartile * the finer 
subdivision by percentiles, or by score points is ignored* How to sub- 
divide the scale is often under the investigator -a control* This is 
particularly true of characteristics that have quasi-continuous dlstri-- 
butions^ *e.g.5 *'age" and "test score". There may also be a choice in 
subdividing a nominal grouping variable* Thus> race can be subdivided 
Into "Anglo" and "Non-Anglo" or into "Anglo", "Asian=American" , 
"American Indlim", and so on* 

In this section we eHamine the within-variable factors that are 
affected by the manipulation of the class boundaries of a given 
grouping characteristic using as an eKample the variables parental 
income (X) and family eKpenditures on higher education (Y) . 

Suppose that educational background is taken as the basis for 
grouping* The investigator can choose the number of groups (classes)__. 
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"m'' for cducnt ioiial back^s'^round and the location of the class boundaries 
Table 4-1 illustrates several pDssibillties for subdividing educational 
background. ^ (^) ^® ^ five-group classification and ^ ^iq^ 
^'(10) ten-group breakdou^s. With fixed m the nuinber of cases 

per rroup and the skewness of the distribution depends on the boundaries 

Since the classifications of educational background in Table 4.1 
give different SS^(X) , the efficiencies of the grouped estimators they 
generate also differ. We cxplora these factors systematically below* 

A. Coarseness of Grouping 

In Chapter 3 we found that the coarseness of groupings by which we 
mean the number of groups formed {m) for a fixed number o£ observa^ 
tions (N) j has important effects on both bias and efficiency of 
grouping. According to equation [3.31] j bias is inversely related to 
m , In additions the efficiency of grouping increases with the number 
of classes. This finding has been supported through analyses of 
empirical and hypothetical data by several investigators (Blalock^ 1964; 
Cramer, 1964; Prais and Aitchinsons 1954). 

The effect of m on effiniency has already been discussed in 
connection with random grouping. The present discussion extends the 
"coarseness" principle to the more general case where the grouping 
variable is nonrandom. In , our examples either ^^XO) ^' 00) 

yields a more efficient estimate* then ^/e\ * With non-zero B ^ the 
groups of ^(10) tend to be more homogeneous than those of ^^^^j * 
In other words ^ the withln-group variation of income and educational 
background is smaJlcr with the ten--group classification of educational 
background than with the five-group classification. This means that the 
corresponding between^group variation is larger with the total variation 

105 



92 



Table 4.1. Altet-nativc grouping variables based on the sams grouping 
characteri^jtic. 





Grouping Variables 






^(10) 


(10) 




0-6 Years 


None 


0-6 Years 




7=10 Years 


1-2 Years 


7-10 Years 


Classes 


11-HS Diploma 


3-4 Years 


ll-HS Diploma 


Describing 


1-3 Yrs. 
Beyond HS 


5-6 Years 


1-2 Yrs, Beyond HS 


Father's 


More than 3 
Beyond HS 


7-8 Years 


3-4 Yrs. Beyond HS 


• 




9-10 Years 


Bachelor's Degree 


Education 




11-12 Years 

13-14 Years 

15-16 Years 

More than 
16 Years 


Work Beyond Bachelor's 

Master's Degree 

Work Beyond Masters 

Degree Beyond Masters 
(PhD, bro, LLD, etc) 
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held constant. Ho the ^"orrelatiDn ratio of Bither Z.,^. or 

X (xO) 

^'(10) greater than that of ^(5) estimatti more efficient. 

Cramer's paper (1964, p. 2A1) provides a particularly illuininating 
analysis of this topic* Ht considers the case where the individual 
observations are ordered according to their X values and the sattiple 
range is divided Into m equal intervals. The total sum of squares is 
partitioned into between--groups and within^groups sums of squares j and 
the components are divided by the total. After rearranging terms, 
Cramer arrives at the efficiency equation' 

SS <X) SS fx) 



SS^(X) SS^CX) 

where SS^^CX) is the pooled within-group sum of squares of the X_ 
Cramer then estimates, SS (X) and SS (X) . For the sample of 



original obseir vat ions. 



■SS^CX) = No| 



where oj is the population variance. 

For his grouping method, the width of a. 1 class intervals is 
uniform and equals 



rrange(X) 



where the sample range of X is expressed in terms of the population 
standard error, Cramer then states that if the sample X, , are 
uniformly distributed within each class , the within^group variance. 



of each class is 



WVCX.,) 



range (X) 



m 



So t:he pooled within^clnss variation is approximoted by 

range (X) 



[/».2] SS^(X) a ^ 
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By substituting [i\.2] into [4.1], we obtain the approKimation 

[4.3] " 1 ^ [M>3J£(2lX]^ . 

T 12m- 

Cramer points out that his apprDxiination is justified for large M 
and relatively small m because it depends on the replacement of random 
variables by their eKpected values. He also emphasizes that his esti-- 
inate of the within-groups variation of X is an overestimate when the 
actual distribution of X within class is a strip from the normal 
distribution J and not a rectangle. 

One can use values from the sampling distribution of range (X) 
to provide efficiency estimates of various combinations of m and N , 
From Cramer, the expected values of range (X) with the sample sizes 
100, 200, 500, and 1,000 are s/oiS, 5.492, 6.073, and 6.483, respec-^ 
tively. Table 4.2 includes the efficiency of grouping N observations 
into m equals-interval groups* The values are in agreement with a 
similar table by Cramer (1964, p. 244)* 

Efficiency appears to be very high except with very small m , 
Most investigators would happily use group means when the efficiency of 
the regression estimate from grouped data is high to reduce cost of 
data processing* 

Cramer describes a fairly representative method of grouping in 
economic studies. Unfortunately, his findings do not apply to 
Category III grouping variables with unequal intervals nor do they ap*- 
ply to variables in other categories* Equal-'intarval grouping may not 
be appropriate in many educational investigations. Thus we cannot 
expect estimates as efficient as those depicted in Table 4*2, 
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Table 4.2* Efficiency of alternative ways of grauping on tlie same 
characteristic eis a function of sample sl^e and number_. 
of groups. 



Sample 

Size 
N 


E [ range (X)]* 


mB2 


Efficiency 
Number 

m-4 m=5 


(SS(X)/SS(X)) 
of Groups 

m"10 in=20 


m=25 


100 


5.015 


0,476 


0 . 869 


0.916 


0.979 


0.995 


0,997 


200 


5.492 


0.372 


0.843 


0.899 


0.974 


0.994 


0.996 


500 


r6.073 


0.232 


0.808 


0.877 


0.969 


0.992 


0.995 


1000 


6.483 


0.123 


0.781 


0.860 


0.965 


0.991 


0.994 



*See page 93* 
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B. Distribution of Observations Ajiiong the Groups 
'The~'^dii¥rl groups is of concern 

only when there are some groups with very few observations and when the 
independent variable is iniperfectly raeasured. In the fornier casej some 
group tneans are unstable, and their instability reduces the precision 
of the grouped estimate, 

A large nutnber of observations per group are needed to cancel out 
the effects of random errors of measurement on the independent variable 
(Blalock, Carter, and WellSj 1971). In the example above, this can 
mean that Z^^^ is better for grouping than Z^^^^ ^ depending on the 
within=group distribution of the income values and on the size of the 
errors. 

It is not always easy to determine whether there are enough obser-- 
vat ions per group for adequate stability* Generally, grouping variables 
with large akewnesa coefficients yield imprecise estimates. However, 
with other variables, groups with few observations are scattered along 
the Z scale. With these variables j the investigator must rely on his 
understanding of the nature of the grouping characteristic and its 
relation to other study variables to avoid imprecise estimates. 

C. Distribution of the Independent Variable Within and Among 
Groups 

Though m ^ 10 for both ^^qj Z'^qj in Table 4.1, the 

two classifications yield equally efficient estimators only when 
V(x[r:^^Qj j)^ V(x|z'^qP . The subdivisions of these two classifica^ 
tions are not likely to result in equal between--group variances and the 
pooled within^group variation in X for ^^qj (10) 
undoubtedly different. Thus the within=group distributions of X and 
the overlap of these distributions are affected by che placement of the 
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clans boundaries j and, in tuirn^ affact the afflclencias of grouping. 



Even without a joint distribution of income and edueational 

background, it Is possible to anvision the properties of this distrlbu- 

tiOR ofter classification. With ^^Qj mean incomes and incotne 

ranges are approximately the same for the ''none" through '7-^8 years" 

groups. Hence J the Income distributions of the groups from Z,,^^ 

(10) 

overlap a great deal. Individually^ some of the groups contributed 

little to the between'-groups variance* In fact, collapsing the five 

lowest groups into a single "0-8 years" group does not greatly change 

the between-groups variance. So Z^^^n ^c^tm rather like, say^ a Z^^^. 

(6} 

^'(10) * other hand, has wide Intervals at the lower end^ a 

relatively uniform distribution of observations, and large variation in 
group means. It forms homogeneous income groups by adding groups at the 
upper end and collapsing similar (in income) groups at the lower end. 
We suspect that forms income groups which are more compact 

(smaller within=group variation) and more distinct . (less overlap among 
groups) than those from ^^qj * If so, this combination is sufficient 
to ensure that the between-group variance in income will be greater with 
^'(10) ' grouped estimator more efficient* 

In general 5 classifications which yield small within-group varia-- 
tion in the independent variable are praf erred. This type of classifi^ 
cation decreases the pooled within^grDup variation and thus increases 
between-group variation. 

The effects of overlap of the within--group distributions of the 
independent variable operate similarly. As the overlap among distribu- 
tions decreases, grouping more closely resembles direct stratification 
on X 5 which is optimally efficient. 
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D, Summary 

The wlthin^v^rlablG factors that affect estltiiation are inter- 
dependent and .tend to constrain' each othar . Insofar as finer breakdowns 
increase the relation between the grouping variable and the independent 
variable J information loss declines and precision increases. If the 
characteristicr^is Judiciously chosen ^ The investigator can quickly 
arrive at a grouping which balances the competing factors and yields 
estimates which suit his purposes* 

II* Scales of Measurement — Nominal Grouping Characteristics 

So farj we have treated the grouping characteristic as if it has 
at least an interval scale and thus has specifiable linear relations 
with the dependent and independent variables. The ne^t step is to 
consider grouping characteristics that have nominal scales* 

Sound procedures for predtcting the effects of a nominal grouping 
characteristic are urgently needed in. .educational research* Cross- 
level inferences from aggregate sampling units such as schools occur 
frequently; careful examination of the consequences is needed. Unfor- 
tunately , the sociological methods developed to date are often complex, 
and some apply primarily to relations mnpng unordered variables 
(Goodman* 1959; Iversens 1973), 

Our approach is to try to fit structural-equation methods to this 
case, shall incorporate the nominal grouping characteristics into 

the model as we incorporated ordered characteristics- Two schemes for 
incorporating the nominal grouping characteristic are discussed below- 
Wiley (personal communication) actually offers a new conceptual schema 
for analysing the grouping grocess. The other approach, the creation of 
tnultlple dichotomies to 'represent the noxninal characteristic, adapts a 
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familiar econometric technique. 

A, Categori'^ation by Scale and Type of Variable 
To this point we have cansidered only the manifest relations of 
grouping variables to the other study variables. We have not attenipted 
to describe the latent forces that underly the grouping of observations, 
When the manifest grouping characteristic has a nominal scale, a more 
careful examination of the classification process ^may prove useful. 
Classification procedures such as latent structure analysis have been 
discussed in this context* We consider here the implications of a 
procedure suggested by Wiley for aggregation problema, 
1. The Clasaif ication Matrix 

Wiley's scheme for cl&ssifying grouping variables is a variation 
of the model represeated by the structural equations [3* 14a] and [3 •14b] 
and by the path diagracs in Figure 3*1* Additionally, however, (1) each 
E is now said to be either "fixed'* or a "random" variable ^ and (2) at- 
tention is now paid to whether it has either a nominal or interval scale. 

Before, a grouping variable was spoken of as random if the indivi- 
dual observations were randoinly allotted to groups. Here^ Z is 
considered a random variable if the groups of Z are randomly sampled 
from some broader population, Z thus operates like a random factor 
in the analysis of variance as opposed to a. fixed factor. Randomness 
is a property of the selection of groups, not of the assignment of 
observations to the groups* 

To clarify Wiley's scheme, consider the following hypothetical 
data set* Suppose that data on the following groupiiig variables were 
collected in an international study of the relation of home environment 
to mathematics achievement i the sex of students, the nation ^ the 
classroom, the school, the school sige, student mathematical aptitude. 
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and the salary of the student *s math teacher. 

We can clnssify each variablG within a scale « type-af-varlable ' 
grid. The nominal vs, interval dichotomy is relatively stiralght forward. 
In their present form, schDol flizoj math aptitude, and teacher salary 
are the variables with intarval scales* 

Classification by types of yariable requires more thought. It Is 
likely that the classrooms in the study are important only as ^'repre- 
sentative'\ of similar entities. Therefore, we can treat the classrooms 
as random samples from some larger population of classrooms. 

EKamining each grouping variable in the same manner leads to 
classification matrix Ai 

Mat IT ix A 





FIXED 


MNDOM 


NOMINAL 


Sax 


Classroom 






Nation 






School 


INTERVAL 


School Siae 
^th Aptitude 
Teacher Salary 





2, * Manifest vs. Latent Grouping 

Wiley argues that, in general, grouping characteristics like school 
and classroom are surrogates for some unmeasured varlablas which have 
interval scales. (Without loss of generality, we assume there is only 
one umeasured variable*) In other wordSj there exists some underlying 
interval variable Z which determines group raeraberahip when observa- 
tions are manifestly grouped by a nominal variable • In out present 
example, this might mean that nation is really a proxy for, say, nation- 
al commitment to education- then grouping by nation would approximate 
grouping by national commitment to education (as measured on an interval 
scale) , 114 
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We can illustrate the interrelation of Z and Z by Incorpora- 
ting both in the path dlagramp This Tnodel is presented in Figure 4.1. 
Wlien the data are grouped by E ^ Figure 4.2 represents the aggregate 
path diagram corresponding to Figure 4.1. 

Given these path models the investigation properly focuses on the 

conditions under which y +TY + XX . The question to be 

I 2 3 1 2 3 

answered is ''Does grouping by Z+ affect Z^ in a way that will change 
the relation of. X to Y If the answer is yes, then grouping by 
nationality yields biased estimates. 

Z cannot be directly measured. It la s Isteri^ variable analogoua 
to the latent traits of factor analytic models. However ^ values of 2^ 
can be estimated by B(Z^) , a discriminant function describing the 
differences in the classes of Z with respect to variables potentially 
influencing the grouping process. 

In the example above , national commitment to education is the 
latent variable represented by nation. Substantial auxiliary Informa- 
tion, such as par pupil expendituras (Wi) , aducational expenditure as 
a proportion of national GNP (Wa) j and proportion of children enrolled 
In school at^ say, age 15 (W3) is needed to have a prayer that D(Z^) 
generates good estimates of 2 values. The equation representing 
this relation would be 

National Commitment to Education ^ D (Nation) + 6 

m + + 03W3 + 6 

where the ^'s are the varlabJ weights in the discriminant function 
and 6 represents unaccountable differences in national commitment, 
6 must approach zero If the grouped estimate is to be unbiased. 

115 
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E — loanifest (or Baeasured) grouping variable 



2 latent (or unmeasured) grouping variable 



Y dependent variable 



X — Independent variable 



V5 w — ^ disturbance terms for X and Y 



Structural parameter for relation 
designated by corresponding arrows 



T disturbance term for E 



Figure 4,1* Fath diagram incorporating both latent and manifest 
grouping variables. 
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X — Aggregate independent variable 
(group means based on Z^) 

Y ~ Aggregate dependent variable 
(group means based on Z^) 

Z ~ Aggregate latent grouping 
variable 

v, w ™ Structural parameters for 
aggregate X and Y 

— structural parameters for 

aggregate relations designated 
by corresponding arrows 

Figure 4,2. path diagram for aggregate data grouped by 
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This, is a minimum condition for maintaining consistent relations among 

CO 

X J Y s and Z at the individual and group levels. Otherwises the 
influence of 6 j which has an effect on X and Y independent of 
E J will change between levels- 

Returning to hypothetical data, we can conceivably estimate the 
Z 's of the particular classrooinsj schools * and nations* In factj all 
the nominal variables can be handled in this way* If so then the new 
classification matrix would be 



Matrix B 
FIXED RANDOM 



NOMINAL 








School Size 


D (Classroom) 


INTERVAL 


Math Aptitude 


D (School) 




Teacher Salary 


D (Nation) 




DCSeK) 





3, Evaluation oS the Wiley Classification Scheme 
According to Wiley's sdheme^ we can always generate an interval 
grouping variable if enough information is available* The investigator 
cannot translate his knowledge of the underlying grouping variable into 
an ordered function without resorting to classification procedures of 
this sort* 

At the same time^ however, the search for an underlying grouping 
variable greatly complicates the procedure for choosing Z , Where 
before only estimates of B ^ , Pv^+ i arid ^ were needed , we 

must now find the underlying Z * BesldeSp we still have to determine 
optimal class intervals (with respect to within-variable factors) for 
D(Z^) after the variable has been generated* 

The benefits from estimating D(Z^) are derived mainly from the 
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uncovaring of che inherent causal patterns among the grouping varlablas 
which affect the estimation of g _^ . If the investigator's efforts 
are directed toward "purity" in aggregation and more accuratci specifi- 
cation of the models Wiley's methods can be useful. It makes little 
sense, on the other hand^ to estimate Z solely for the purpose of 
having an interval grouping variable* 

The type^of --variable distinction raises serious questions about 
the process of grouping* If the classes of the grouping variable are 
fixed, then there is no change in the conceptualization of grouping 
effects. If^ on the other hand^ the classes are random ^ ithe original 
observations should be treated as a single or two-stage cluster sample 
rather than as a simple r^dom sample for the purposes of grouping* 
In cluster sampling, the selected clusters (individual classrooms, for 
example) are a simple random sample from -lie population of clusters and 
sampling within the clusters is also random* ' 

The distinction between cluster and simple random sampling 
apparently has not been made before in the context of grouping. The 
usual regression' analyses start with the assumption that the data are 
a simple random sample* We do not find fault with this assumption for 
the ungrouped observations or for a fixed number of groups. The 
sampling properties of the data become an issue only £.iter grouping* 
The question then arises as to whether the classes o? 2 can be con-- 
sidered a simple random sample since the classes bectime the units for 
analysis. An unbiased estimate is impossible if the grctips themselves 
are a non-random sample , whether the units are the original observa^ 
tions or the weighted group means. — 
B. DuTMy Coding 

, Economists generally employ dummy coding methods to incorporate 
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nominal characteristics in their models. This procedure is less 
coinplax than Wilay*s and may prove fruitful for our purpos^^s* 

III applying dummy codings 'we represent any iiomln.^i characteristic 
with m groups by m-1 (or in , de^endlxig on tha cuinputcr program) 
dichotomous dunmiy variables in the basic structural equations. 
Equations [3,14a] and [3.14bh which Incorporate the grouping 
characteristic become 

[4.3a] Y ^ a + g_ X + 6 7 + 

+ ^Yz .X 7 7 ^™ 1 ^ » 



[4.3b] ^ ^^^^^^ 

i ^ m— JL 



"-m-l^l-'-^^^2 ^ 
where the Z^, 1 ^ 1* ra-1 | are the dichotomous variables repre- 

senting group membership, and the B and 

"± "'^1' * , , ^^i»l*^i+l* * ' ' * Vl 

the B 

1 ■ 1-1^^1+1* '^fli-l structural parameters In 

the regressions with Y and X , respectively, 

Theiij if r|^^ is the squareu norrelatlon coefficient of X and 

Y and Z Z Z squared multiple 

' 1 ' ' ' ' ' m--l 1 ■ ' * ' m-1 

correlation coefficients from incorporating the dichotomous regressors 

based on Z , then the direct strength of the relation of Y to Z can 

be estimated from the square root of the variation accounted for by 

incorporating the dunray variables. That Is* we estimate 0 from 

YZ * X 



/ 



1 m-1 
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The relation oi; X to Z (&^^) can be estimated from 

/"^•^I'-'-^m-l • 

This estimation procedurG rGquirGS some justification. The reason 
for the use of the square root of the variation accounted for is to 
have units comparnble to the standardized regression coefficients from 
incorporating interval grouping variables/ The "additional variation 
accounted for" notion embodied in our suggested estimator 0 is an 

attempt to Identify any relationship between Y and Z's that is masked 
in the simple linear model for ungrouped observations (Equation [3,1]). 
The estimator suggested for g^^ provides an indication of the magnitude 
of the relation between X and Z*s it) would also fulfill this func- 
tion) , In this way^ we hope to make direct comparisons of the effects 
of nominal grouping characteristics with the effects of interval charac- 
teristics. For this reason alone ^ the dummy coding strategy provides a 
viable alternative to the classification procedures which necessitate a 
search for the latent causes of group membership,^ 
C. Summary 

Neither Wiley's scheme nor the dummy c odin g approach yields perfect 
indices of the relations of a nominal Z to X and Y , but both warrant 
further consideration as alternatives to those previously proposed. They 
at least provide a starting point for refining the "structural equations" 
approach in the nominal case. 



Werts and Linn (1971) discuss the regressjion analysis for "compositior^ai 
effects"^ which involves the Incorporation of X.^ , rather than Z in 
the simple model. Using X.^ instead of Z in^the modified structure 
has the advantage of ensuring that the grouping mechanism is represented 
by an ordered variable, regardless of the scale of tne grouping charac- 
teristic. However s with multiple regressors, this strategy can become 

cumbersome rapidly unless one incorporates ^ say, the values from the 

best linear discriminanc function (discriminating among the Z values 
- ' •"'^ "^^^^ c- functlDn^qf .the X!s>*^ Still, the Werts^lnn method 

deserves more consideration than we "have given it here, 
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CHAPTER 5 

PRELIMINARY NOTES ON THE MULTIVARIATE CASE 

Our findings on the effects of. grouping in the blvariate case can 
be extended to the multivariate case* Problems caused by correlated 
regressorss however, can complicate the interpretation of, grouping 
effects* These problems are considered below. - 

We begin by reviewing previous work on the multivariate case, 

■ - ■ . • 

considering papers by Praia and Altchinaon (1954) , Haitovsky (1966 • 
1973), and Felge and Watts (1972) v To slmpiif y our own ;d^^ 
we analyze the, three-variable case where Y Is regressed on Just two 
Independent variables, X and W . The grouping variable Z enter S; 
as a fourth variable. The parameters to be estimated are the 
regression coefficients 6^^^^ and Pyw-X' The conelusions are 
generallaable to any number- of regressors. / 

The earlier taKonomy is expanded to consider the Interrelation of ^ 
Y , X , W , and Z for a specific causal ordering of X and W ^ v 

This taxonomy is used to investigate the bias In estimating: the v 

« ' ' • ■ ... . 

regression coefficients. ™ 

■ -■ " ■ ' 

I. Previous Work on the Multivariate Case ' ^ 

Whereas univariate prediction with grouped data has been considered 
by persons from several social science disciplines, the treatment of 
grouping effects with multiple predictors has remained purely in the 
domain of the econometriclans. Prals and Aitchinson (1954) seemingly 
stood alone until Haltovsky (1966) suggested^that grouping can Indeed 
cause bias in the multivariate- case. Feige and Watts (1972) — 
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apparently imfaniiliar with Haitovsky^s work — raised tnuch the same 
question. Below we attempt to reconcile the conclusions of Prais*- 
Aitchinson, Haitovskyj and Feigt^-Watts. 
. A. Transfomation by a Grouping MatriK ~ Praia and Aitchinson 

Praia and Aitchinson (1954) derived formulas for grouped 
estimation In the tnultivariate case* Thay employed matrix notation 
throughout* 

Consider the usual postulated modal for multiple linear regression 

Y s X| + u , 

whara Y ^ X ^ § , and y *ire matrices of orders N >^ l^ N ^ 
k ^ 1, and N ^ 1, respectively. We assu:?.© that the rank of X is k ^ 
the number of regrassors, where k is less than or equal to the number 
of persons N * 

An estimate of | can be found by the principle of leaat-squares 
. (LS) * The assumptions in the multivariate case are analogous to those 
of the model (equation 3.1) • They are as follows i 

Bl, The X are fixed or else the X are random variables with 
- joint distribution independent of u , 

B2. E(u) ^ 0 , 

B3, V(u) - E(uu') - ^ ^5 ) ^^N ' ^here is a kno^^^'^ matrix of 

ordejf N * ' 
B4* X is of rank k , 

The principle of least-squarGs provides an estimator of g that 
ralnimizes the sum of squares of deviations of Y and Y' . This 
estimator is given by 

[5,2] b ^ (X'X)^^X'Y . 

b If an unbiased estlmatC'r of § , If the u are normaliy distributed 
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then b is also a maximum llkGlihood ^sUimauor (tlLE) , 
The covarlance iriatriK of the vector b is 
[5.3] cov(b) - E[(b-§)Cb-g)'] - cJ(X'X)^- , 

and the residuals f ^ ^- Xb are linear functions of the dlsuurbances 
u * 

Prais and Aitchinson next introduced an m by N grouping 
matrix G which maps the original observations into their appropriate 
groups and weights each group by the number of observations included. 
Thus C is a weighting matrix in which the weights are determined by 
the number of observations in the various groups. The value in the ith 
row of G is 1/*^- persons belonging to group i and 0 for 

persons not in group i . For example^ with five observations divided 
into two groups (m ^ 2) with the firsts third, and fourth in the 
first group and the second and fifth in the other, the weighting matrix 
is _ 

0 1/3 1/3 0 
^ " 0 1/2 0 0 1/2 



Note that 



and 



CGG') 



-1 



1/3 
0 

0 ■ 



0 

2 



That is, the diagonal elements of the inverse of GG* indicate the 
number of observations per group. 

The regression model for the grouped data then found by premulti- 
plying [5.1] by G to get 

GY = GX| + Gu . 
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since G Is a weighting matrix, it gives us means, i.e., GY » Y , 
GX_= X , and Gu = u . So we obtain 

[5.4] f = Si + y . 

By assuming that the number of groups formed, m , Is greater than 
or equal to the number of parameters estimated, it follows that GX 
is of rank k . Consequently, the assumptions B1-B4 apply to the 
model [5,4] where 

ICu) » 0 

and 

V(u) = o2gG' . 

Under these conditions, the grouped estimator B for p Is $ 
[5.5] I ° [?' <G9')"^f 

= (X'HX)"*^X'HY 

where 

H' = G' (GG')~^G 

For X flKed (or for X random, because of assumption Bl and the fact 
that E(Y) « Xg ),• 

E(B) = [(X'HX)"-'x'H]ECY) 
CX'HX)~lx'HXr 

= § 

The covariance matrix of B is given by 

[5.6] covCb) » o2[X' (GG')"^X]"^ 

= o2(x'HX)*"^ . 

u - 
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Prais and Aitchlnson concluded that "whatever the method of 
grouping, the resulting estlinators will always ba unbiased" (1954, p. 1) 
But this contradicts the results of Chapter 3 for grouping in the single 
regressor case. The work by Haitovsky (1966; 1973) and by Fiige and 
Watts (1972) and the new material in Sections II and 111 of this chapter 
identify limitations of their formulation which led them Into difficulty 

Prais and Aicchinson also provided an overall measure of the effi« 
ciency of the method of grouping t 

tr(X'HX)-l 
[5.7] EffCbjB) s " 

tr(X*X)-l 

• 

the ratio of the sum of the diagonal elements from the covariance matrix 
of b to the corresponding sum from the covariance matrix of B , In 
the single^regressor case with X fixed ^ their efficiency formula 
simpiiflea to become the ratio of the between-^group sum of squares to 
the total sum of squares, the equivalent of Cramer's formula (see page 
47). When there is no bias from grouping, this measure of efficiency is 
appropriate* 

Estimates from Classification Data — Haltoysky 
Haitovsky (1966; 1973) called into question the conclusion of Praia 
and Aitchinson, He demonstrated problems that arise when the regressor 
data are in the form of one-way ciassif ication tables, with frequencies 
of the cross-classifications unknown. According to Haitovsky, grouping 
on one independent variable can lead to biased estimates of the multi- 
ple regression coefficients in this situation, 

Haitovsky analyzed data from a study by Houth:^kker and Haldl (n.d) 
to illustrate his conclusions. In the Houthakker^:'laldl study, automo- 
bile purchases (Y) %^ere regressed on individual Income (X). and 
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Initial automobile inventory (W) . llaitovsky grouped observations on 
X and W separntcly as well as on the cross-classif Ication of X 
and W , His estimates for Syjj.y and &y\'1*K presented in 

Table 5.1* 

The estimates from the cross«classif Ication were fairly accurate. 
The single-variable classifications yielded estimates with hugh standard 
errors. If 7 or 8 groups had been formed randomly ^ we would have 
expected the standard errors to be even larger, 

Haitovsky failed to note an interesting trend in the data. When 
the observations were grouped by one regrissor, say, X , its regression 
coefficient 6^^^^^ was better estimated^ in terms of smaller bias and 
standard errors, than was the coefficient PyW'X other regressor. 

That is, grouping on a regressor affected the estimate of its coeffi- 
cient less than it did the estimate of the coefficiencs for other 
variabltis , 

As Hannan (1972) put it, Haltovsky'a paper showed that "in the ~- 
multivariate model , grouping by some concrete criterion which 
approKimates grouping systematically by a subset of the regressors 
can produce appreciable bias," (p, 33), Hannan also pointed out that 
the bias Haitovsky described is essentially specification bias. That 
is, bias arises through the failure to include all correlated regressor 
variables in the data analysis* 

According to Haitovsky and Hannan, unbiased estimates are obtain-- 
able if the investigator groups on all regressor variables jointly. 
But with a large number of independent variables each having several 
classifications, grouping on all Jointly is impractical. Unless other 
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TablD 5.1, Estimates of regreasion codf f iclents and standard errors 

with altQrnatlvQ grDiiplng methods from the Houthakker-^Hnldi 
study^. 



Number of Groups ^ 

Grouping Method (m) YX*W ^YW*X 

Ungrouped data 121S .758 -.178 

C*139S) (.0357) 

Incoma(X)^x- 56 ,747 -,162 

Inventory (W) (-1203) (.0323) 

Income (X) only 7 -551 ,038 

(1.6139) (1.9752) 



Inventory (W) only 



-.653 
(2.5391) 



-.093 
(.1572) 



The numbers in parGntheseB are the estimated standard errors of the 
correaponding estimates. 
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evidence is forthcoming, in is easy to r. .:ee with llannan's conclusion 
that the analyst must have a good deal of confidence in the substantive 
aspects of his model before concluding that any grouping procedure is 
"optimal", - 

Aggregating Data to Preserva Confidentiality — Feige and Watts 
Feige and Watts (1970 | 1972) considered the feasibility of data 
aggregation as a means of preserving the confidentiality of. data*. Thay 
developed statistics for evaluating the loss of information from 
grouping in this conteKt. One measure indicates the degree of divarr- 
gence between estimates from grouped and ungrouped data, and the'^'other 
indicates the loss of efficiency # Feige and Watts applied a variety of 
grouping procedures to a large data set and assessed the resulting 
parameter estimates* 

According to Feige and Watts, differences between the ungrouped 
and grouped estimators may be composed of (1) specification blas^ (il) 
bias Introduced by a grouping that is not Independent of the disturbances, 
or (ill) sampling error induced by the loss of information In grouping. 
Their second source is most pertinent to our discussion since it sug^ 
gests that even when the regressors and disturbances are independent at 
the individual levels bias can still result when the grouping matrix 
G is not independent of the stochastic disturbance u (see p. 51-52), 

When their description of bias from grouping Is translated into 
more familiar terminology, we find that Feige and Watts actually 
described the case previously discussed by Blalock (1964) and Hannan 
(1970; 1971) where the regressand Is the basis for group claseiflca- 
tion* In this casei since Y Is a linear function of u , grouping 
on Y ensures that H and u are not Independent when Y is the 
grouping characteristic and thus the estimate from the Y-on-X 

129 



^ 115 

regression is biased (see pp. 51^52) for a sunimary of Blalock's 
reasoning) * 

The problem of gauging the laognitude of the divergence remains if 
the analyst is to systematically choose among alternative grouping 
methods* The Feige-Watts measure of divergence is based on the differ- 
ence between b and B . We summarize the Feige^Watts analysis below^ 
following the Praia^Aitchinson notation md transformation procedures 
for generating the model at the group level* Relevant equations from 
our discussion of Prais and Aitchinson are repeated for clarity. 

Equation [5.1] with its accompanying assumptions is again the 
basic model for the ungrouped observations* We havei 

[5,2] b ^ CX^X)^^K'Y 

and 

[5.3] covCb)^ o^^X'X)"^ 

A grouping matrix G transforms the raw data to a set of m rows; 
the ith row contains the mean values of the variables for the ith group* 
I,e,j the matriK [YpX] is replaced by 

if ^ 191*953 

Recall that 

^ H ^ G'(GG«)^^G . ' 

Hence ^ the estimates of | and its covariance matrix from grouped data 
can be written as 

[5,5] B ^ (X'HX)^-X'HY 

and 

[5.6] cov(B) - 02(x»HX)^"^ 

The divergence between grouped and ungrouped estimates of 0 , 
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ACH) = b - B 

has a zero mean and variance-covariance matrix equal to 

Let e ^ Y " XB so that e-e is the sum of squared residuals 
from the between-grpups regression* Assume additionally that the 
disturbances u are normally distributed. Theni acGording to Feige 
and Watts ^ the quadratic forms 



u 

and -^1^ 

e' § 

are distributed as with k and m=k degrees of freedom ^ respec- 

tively, 

Feige and Watts claim that if the model is correctly specified 
and H and u are independent , 

(Q,/k) 

[5.8] r » ^ 



Is distributed as F with k and m-k degrees of freedom. Values 
of r beyond the critical values of the F-distribution indicate dlf- 
ferences between estimators that cannot be attributed to sampling 
error. They assuciate good grouping methods with i .aall T values. 

The Feige=WattLi efficiency criterion is similar to the one that 
Prais and Aitchinson derived, (See Equation [5*7].) Feige and Watts 
remove the influence of the constant term^ as no information is lost 
in estimating this parameter. Their efficiency measure Is 
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[5.9] ^ " ^ ^tri(X'X)'^-X*UX] - 1} , 

where tr [ (X*X)'"^X'HX] is the sum of the diagonal elements of the\ 
matrix whose entries are the ratios of between-group sums of squaras 
and cross--products to total sums of squares and cross^-products. Thus 
Feige and l^Jatts also reconimended forming groups horaogeneous with 
respect to the independent variables in the analysis to minimize loss 
of efficiency. 

To illustrate their findings, Feige and Watts eKamined twenty 
regression equations generated from income and dividend invorraation 
provided by 5,393 banks to the Federal Reserve System* The seven 
grouping rules they used included a random procedure and geographic " 
and financial asset indices. There were also three levels of aggrega^ 
tion ~ slight (3 observations per group) , moderate (30 observations) 
and drastic (100 observations). Thus twenty-one grouping methods were 
possible for each equation although the article only discussed a few. 
Certain of the Felge-Watts equations were quite sensitive to the 
choice of grouping rule and level of aggregation. The reported T 
values ranged from *02 to 84.95, For one equation, all the r values 
were significant at the *05 level, while grouping produced no signifi- 
cant results for other equations. The efficiency indices ranged from 
.038 to *6895 with systematic grouping serving much better than random 
grouping. In every case, slight aggregation was superior to moderate 
-or drastic aggregation in terms of bias and efficiency. Thus a large 
number of groups again proved to be desirable. 

Otherwise, the Felge-Watts exam ples demonstrate the tradeoff 
between efficiency and bias. Random grouping is Inefficient but un^ 
biased. Systematic grouping raises the likelihood of misspeclf icatlon 
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and grouping bias, but improves afflciGncy* 

It is worth noting that the test that Feige and Watts propose for 
devergunce (Equation 5*8) niay not hm the most appropriate in this 
instancG* It may well be that the numerator and the denominator are 
not independent ' of. each- otherT--"Furthermorej there is an inherent asym= 
metry in that the denominator is based solely on the aggregate* residuals 
whereas the numerator is a function of both ungrouped and aggregated 
information. 

The traditional Potest for differences in regression models takes 
the form: 

Cl-R|)/(N-dfp) 

where 

r| ^ squared multiple correlation for the so-called "full** 
model (the more inclusive model) 

^ squared multiple correlation for the "restricted** model 

and 

df J df = degrees of freedom for the full and restricted 

t K 

models ^ respectively 

There is no recognizable standard for Interpreting the comparison of 
individual-level and aggregate regression models in this fashion* 
Intuitively^ however, it Is^ appealing to associate the individual-level 
model with the "full" model above and the aggregate with the "restricted" 
If this interpretation is defensible, then the residual sum of 
squares from the individual-level regression (§*e , where e ^ Y Xb) 
would seem to be more appropriate than Feige and Watts' choice for the 
denominator* This is a problem worth eKploring further, but it is 
outside the domain of the present inquiry. 
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II. The ^jStructiiral Equat ions" Ap proach In the Two-Predictor Case 

The Haitovsky and FDlge-^Watts conclusions require elaboration since 
neither presented a way to detect which subset of estimators is biased 
by grouping. Our analysis of the multiple-regressor case departs from 
the previous work. First, we specify the order of all variables; the 
grouping variable is treated as prior to other variables to which it 
relates. Secondly^ each regression coefficient is considered separate- 
ly. This method, though more cumbersome than a matrix approach, enables 
us to determine whether the relations of the grouping variable to the 
regressors and regressand provide clues as to which subset of ostimates 
will eKhibit bias. If this strategy works, we will be able to state 
general prinniples for determining which estimates are biased for any 
number of r^j;ressQrB. 

We follow a proc^'sdi_e Bimll^- t:o that used in Section IV of 
Chapter 3 wirrh the *Jvar5ate case, A multiple-regression model with 
two regressor& :X and W) is modified by incorporation Z and by 
specifying the structur.^ among Y , X , W , and Z , This four- 
variable structural model is then represented by simultaneous equations 
dascribing the relations of Y to X , W , and Z , of X to W and 
Z , and of W to 2 - 

Formulas for 3^^^^^^ and P,^,^ are presented in terms of the 
parameters of the structural equr^tions at both individual and group 
levels* The formulas are appropriate for the case when the sample 
equals the population and under certain conditions for other sampling 
designs. Any differ . nee between coefficients from grouped and un- 
grouped data is once u.^^re attributed to the effects of grouping. 



A. The Regression aquation with Two Regressors 
The equation relating Y to X and W is 
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[5.10] 



where 



and 



[5.11a] 



[5.11b) 



a, 



YX W 



YX'W 



2 

o o 
YW X 



°YX^XP7 



YW'X 



Assumptions B1-B4 atill apply so that u Is indapendant of X 
and W • The object of the Investigation is to estimate ^yx*W 
^YW*X equation [5*10] using grouped data* 

B, Modified Structure with Z Incorporated 

The next step is to constrain the model by spacifying a structure 
among Y ,j X ^ W , and Z * As before (see page 53), we treat Z as 
prior to Y , X ^ and W - We also asauma that W is prior to X and 
Y , 

The path diagram of the structura Is 



YX^WZ 




In the diagram, is the disturbance term representing all determiners 

of Y not linearly related to X , W , and Z | represents all 

determinGrs of X not linearly related to W and Z | and reprosGnta all 
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determiners of W not linearly related to Z . , 6- . , 

^YZ-.W * ^m*Z * ^XZ*W ' ^ ^^^^ rcgresoion coefficients. 

The structure generates these simultaneous equations i 
[5. 12a] Y . + i^^.^/. + + B^,^^,^^Z + . 

[5. 12b] X = V+ »xw.z" + »xz:wZ + =X ' 
[5.12c] W»a^ + 0^^Z + ..^ . , 

Once again. S^j^.^^ , S^^.j^^ ' ^YZ-XW ' ^XW-Z ' ^XZ-W ' ^WZ 
regression parameters; a„ ^ , and a,, are intercepts i and £ 

X A w Y 

, and are disturbance terms, is assumed independent of X , 

W 5 e J and e , e is assumed independent of W ^ E ^ and £ 
% assumed independent of Z , We also assume that the disturb^w*. 
terms are homoscedastic and independent ^ as in the single-regressor 
case. 

Besides the intercepts, there are ten parmetersi a| , o^ / , 

2 £^ 

Q * and the six regression coefficients. Rewriting equations 

■Y 

[5,12aj b, c] in terms of these parameters ^ we have 

[5.13a] Y = + B^^^^^a^ + g^.^C^^ + + + B^^-W^ 

+ Pyw.XZ<°M + e„.Z + c„) + e^2.xw2 + =Y ' 

[5.13b) X = «^ + e^;2<«^ + S,^^Z + e^) + Pj^^.^Z + , 

[5.13c) W^a^^+e^^Z + E^ . 

Reduced-form expressions for variances and covariance are 

+ b|„ „o?- + o2 
XW-Z e,^ 
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[5.14b) » 



[5.14c] o„„ = (0 



YX T YX-WZ XW-Z WZ YX'i^Z XZ'W 



^^YX'WZ^XZ'M^XW'Z^WZ ^YIJ-XZ^XK-Z^WZ 

^YW-XZ^XZ-W^WZ ^YZ'XwSz-W , 

+ B S ft ^CT^ + ffi fi2 

Yz*xw xw*z- wz^ z ^^yx*wz xw*z - 



[5a4d] % ^^YX'WZ^XZ-W^WZ ^YX-WZ^JOT'Z^i 



WZ 



^YW-XZ- WZ ^YZ'XW^WZ^ Z 



^^YX-WZ^XW*Z ^YW-XZ^ e; 



W 



w 



The reduced^^fom equations and variance-covarlance eKpressions can 
be used to derive aquations stating 6^^^^ and Sy^,^ In terms of the 
known parametersp By substitution and rearrangements wa arrive at the 
desired equations I 

^ ^-Z-e, 



t^-"^^ ^YX.W = ^YX-WZ + ^YZ.XW^XZ.W 



and 



[5,15b] e^.X = »YW.XZ + Py2.Xw'2 



'W 



W W X 



?2 „2 



WZ e 



XZ'W Xl^'Z e 



W 



w W X 
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C, Equations Based on Grouped Observations 

The GqulTlons in Sections II*A and II*B are applicable to the 
population of ungrouped observatioiis , There is a parallel set of equa« 
tions for the population of grouped observations* 

The initial model for the regression. of Y on X and W can be 
written 

15.16] Y « a + 0-^.^ + g-=.3jW + . 

In 15.16] s" each term is the grouped counterpart of a term for equation 
[5.10], ^fx.fj ^YW'X regression coefficients for the 

grouped observations. Under certain conditions to be discussed below^ 

^YX-W " ^YX.W »YW.X " %.X • 

The simultaneous equations pertinent to grouped data are given by 

[5.17a] 5 = «Y PyX.WZ^ + PyH-XZ^ + ^YZ-XW^ + ' 
[5.17b] 2 - "'x Pxi^Z^ + »XZ.w2 + ' 
[5.17c] W-a^ + g^^^l + i^ . 

The regression coefficients are given by 



[5.18a] 353^.= = 



and 



2 2 
O5O- 



W 



t5.18b]^65^.x 



YW-XZ 



2 

^YZ'XW 2 



2 2 2 2 5 2 9 

W W X 



w W X 



Thus the only difference between equations of the grouped and ungrouped 
regression coefficients <f5,18a, b] compared with [S.lSa, b]) is the 
replacement of population variances by between^group variances • 
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D. Bias Formulas 

^"^ ''yX*W ^YWVX ^® ieast-squares estimators of 0^^^^^ and 

^YW-X * respectively. Also, let Bgj-^- and B~_- be least-squares 
estimauors of Pj^.^-j and Pyw-X ' Under assumptions Bl— B4, 



and 



That is, all four estimators are unbiased for their own coefficients* 
But since the investigator is interested in relations at the individual 
levels his esttoates based on grouped data are biased unless 
^YX*W ^ ^YX*W ^^"^ ^YW*X ^ ^YW^X ' We add a subscript to 6 to 



indicate the regressor under consideration | that is, 6^, will denote 
the bias in estimating g 



W 

Y^^^ from grouped data and 8^ will denote 



the bias from estimating 
[5.18a, b] , we get 



YX-W 



From equations [St 15a , b] and 



[5,19a] 6^ - 



Yz»x^rxz*w< 



2 2 ,^2 2,2^ 2 2 2/«2 2 2 .2 



1^ e,/"WZ^Z 
W 



W W " X w 



? 9 '3 9 



and 



[5.19b] ='6^= E(B-^.-) =E(b^^^.^) 
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A w X w 



YZ-XW 



XZ'W Z e„ WZ Z e_, e„ XZ-W Z e„ 
W W X w 



Thesa bias formulas are complicated 5 especially for the prior regrei- 
aor W . However I it is clear that there will be no bias so long as the 
grouping variable has no direct relation to the dependent variable 

III* The TaKOnomy for Two Regressora 

A taxonomy can be generated by setting various combination of 
^YZ'Xl^ ^WZ zero. This generates 2 x 2x 2 * 8 



0^ 



categories of grouping variables! 

(1) Z directly related to Y ^ X ^ and W (f 



XZ'XI^ 



^ 0 , 



. (2) Z directly related to Y and X , but not to W (0y2*xW ^ ^ ^ 

(3) Z directly related to Y and W ^ but not to X # 0 , 

YZ *XW 

BxZ.W^O ^ %Z^0>' 



(4) Z diractly related to Y , but not to X or W (B^^ ^ 0 , 

YZ'Xiv 
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(5) 2 directly related to X and W , but net to Y 

(6) Z directly related to X , but not to Y or W 

«»YZ.XW " 0 ' 0XZ.W ° ' = 0). . .. • . 

(7) Z directly related to W , but not to Y or X 

(8) Z not linaarly related to Y , X or Z . (6 - 

As the relation of W to X can also affect bias under certain condi- 
tionsj it is useful to subdivide each category on the basis of whether 
^XW*Z non-Eero or not* Figure 5,1 preaenta the sixteen path 
diagram. 

Table 5,2 summariEes the results for bias in the .two-^regressor 
case. Grouping by a variable from five of the sixteen subcategories 
biases the estimate of at least one regression coefficient. There are 
obvious parallels with the single-regressor case* When there Is no 
direct relation of Z to Y j estmates are unbiased. However ^ when 
the grouping variable is directly, related to both the dependent variable 
and a regressor (Categories 2 and 3) , the estimate of the coefficient 
from the regressor is biased when the regressors are correlated. This 
is analogous to Category I grouping in the bivariate case and the 
results are the samet 

The only result that is not analogous to the bivariate case occurs 
when ^ ^ ^® estimate the coefficient of the prior regressor. 

Under this condition, biased estimates S^,„ can result when 2 

is directly related to Y and X even though B ^ 0 . 

WZ 
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^xw.z = 0 





Pyz.xw * °> Sxz.w 
Pxw-z = ° 



0. #0. 



»Yz.xw'* 0' ^xz.m VO'^wz '* 0' 





ERIC 
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o 

ERIC 




Figure 5,1, Path diagrams for the subcategories of the taKonomy in 
the two-ragressor case. 



144 



ERIC 



131 



Table 5*2. Presence of bias frora grouping as a function of taKonomic 
subcategory in the two-regr essor caaa. 







Values of 


Paraneters 




Bias in . - 
Coefficients 


Category 












1 




¥o 


#0 


p 


; /■ * .. :. = . * 


1 


#0 


#0 




JO 




2 


fSO 


^0 


0 


0 


■ " ^ 


2 


#0 


?«0 


0 


#0 


* ■■ * 


3 


#0 


0 


fSO 


0 




3 
4 


#0 
i^O 


0 

0 




#0 

0 ■ 


^. . ....... . . 


4 


¥0 


0 


0 


#0 




5 


0 . 




5*0 


0 






u 


dn 


f 0 


rO 




' 6 


0 




0 


0 




6 


0 


»60 


0 






7 


0 


0 


#0 


0 




7 


0 


0 


#0 


#0 




8 


0 


0 


0 


0 




8 


0 


0 


0 


#0 





* ^ Estimator of regression coefficient from grouped data is biased. 
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IV. Implications of Findings - 

Our taKonomic approach clarifies certain questions raised by 
earlier invastlgations of multiple regression. We have shown that bias 
can result for only a subset of regression coefficients, In-factj the 
conditions under which the estimator of a particular coefficient will 
be biased can now be specified. 

Much has been left unsaid about the practical consequences of 
grouping in the multiple-'regressor case. Bias in estimating at least 
some coefficients is highly likely unless groups are formed randomly . 
With non^random groupings the investigator may group on a variable 
which is prior to all others. Otherwise j he introduces bias in 
estimating some coefficients by grouping jointly on the dependent 
variable and posterior regressors. 

The "structural equations" approach does enable the investigator 
to determine which estimators are biased ^ but the procedures quickly 
become cumbersome as mora independent variables are included* More 
work is needed to determine the utility of this approach^ especially 
.when compared with the procedures developed by Feige and Watts, 
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'■■ CHAPTER 6 

Em^'lRlCAL EXAMPLES IN THE SINGLE-REGRESSOR CASE 

So far we have considered ways of predicting how various grouping 
procedures affect the estimation of smple linear regression coeffi- 
cients. It seems appropriate at this point to demonstrate how well our 
predictions conform to empirical results* Information collected on 
incoming freshmen at a large Midwestern university serves as the data 
base for this investigation. Of 300 measures of abilities, attitudes, 
and interests collected originally^ approKlmately 20 will be used. 
Persons with missing information on any of these variables are dropped 
from consideration* 

First we describe the relevant variables and the form in which 
th^y enter the analysis* Next a simple linear regression model is 
hypotheslEeds and the regression slope and its standard error are 
estimated from the ungrouped observations* 

The data are then grouped. We vary the relation of the grouping 
variable to the dependent and independent variables s the number of 
groups formed s and the distribution of observations among the groups. 
Estimates of the regression slope and its standard error are then 
calculated from the grouped observations for each grouping variable* 
The difference between the observed grouped and ungrouped slope estimates 
(bias) is then compared with that predicted from the formulas of 
Chapter 3* Indices of efficiency are also presented. We then discuss 
the potential utility of composite estimates, formed from the estimates 
generated by different grouping characteristics, in making inferences 
about the individual-level relations, 
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I * Description of Data , . 

All incoming freshmen ac a large Midwestern unlveriity were J 
administered an achievement battery consisting of arithmetic, mathema- 
nice and reading comprehenBiGn subtests during their orlehtatlon session 
prior to entering the university* On the last day of orientationi each 
student was asked to complete inventories assessing his personal history^ 
his interests s his expectations regarding his university experience ^ and 
his opiinions about selected social and academic issues* In our example , 
this information was later combined with data from admissions applications 
and with scores from the Scholastic Aptitude Test (SAT). ^ ^ 

A- identification of Variables 

We focus on the relation of achievement (X) ito self-appraisal (Y) of 
ae^iooinic abilities and of SAT(X) to achievement (Y) * Each student *s total 
score on the achievement test battary (ACH) represents his achievement 
level. The indicator (SRAA) of self-rated academic abilities is a 
weighted composite of responses to ten questions (Table 6*1) asking the 
student to rate his abilities of his work in different academic areas. 

The weights of variables entering the .composite self-rating^were 
determined by variable loadings on the first factor from a principal 
components analysis. The weights were relatively uniform except that 
mathematics ability and scieiitific ability had small weights* Thus the 
analysis leads us to equate students' preceptions of their academic 
ability mainly with their verbal conmunications skills, - 

Subgroups of students were formed on the basis of their SAT^ ACH and 
SRAA scores* Students were classified into subgroups according to the 
hiRhest two-digit 3 of their ACH scores (ACH2, 10 groups f 31=39, 40-49^ 
...» 110-119;. 120) , of their SAT scores CSAT2, 13 groups- 400-499, 500- ■ 
599, , 1500-1599; 1600),; and or their SRAA scores (SRAA2, 5 groups: 
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Table 6*1* Quescions included In composite self-appraisal of academic 
abilities (SRAA) * 



.Use the instructions below for answering questions 1 throuih 4j 

"Rate yourself on each of the following traits as you reallyi think 
you are when compared v/ith the average student of your own age." 

Scalei A* Lowest 10% 

B* Below average 

/ C* Average X 

D, .Above average . ^ - . 

\^ : E. Highest 10% : : 

1, 4Academlc ability 

2, Mathematical ability 

3* Self confidence (intellectual) ^ 

4. Writing ability 

Use the instructions below for answering questions 5 through 8i 

"Rate yourself on how competent you feel you are when compared --to 
other freshmen at the university * *' 

Scales A, Lowest 10% - . 

B* Below average 
Average 

D, Above average ; 

E. Highest 10% I 

5. Overall scholarship 

6* Scientific ability • * . 

7- Reading skills . ^ 

8* Intellectual self-confidence 

9* Where do you think you are likely to rank with respect to grades 
in. your freshman class while in college? 

Scale I ^ A. Among the highest 10% 

B* Above average 

C. About average. 

D. Below average 

E. Among the lowest 10% 

10. Forget for a moment how others grade your work* In general ^ what 
is your own opinion as to how good your academic work will be? 

Scale I A* EKcellent 

B. Very good 

C. About Average 

D. Somewhat below average 
E* Much below average 
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"2.99 to -2.00, *1.99 to »1,00, »0.99 to 0.99, 1.00 to 1.99i 2.00 to 2,99) . " 
ACH2, SAT2, and then, were the grou 'variables based on ACH, SAT', 

and SRAA, respectively. 

The remaining grouping variables were selected according to the 

following criteria I 

1, The variable has appeared frequently in studies of the relations 
among academic self-appraisal, achievement, and aptitude (e.g., 
parental income, parental education, parents' educational 
aspirations for their children), ... . 

2, Alternatively p the frequency distribution of the variable and 
its pattern of Eero-order correlations with ACH, SAT, and SHAA 
suggested that it would be a suitable representative of a 
particular taxonomic category (e.g^ number of semesters of 
high-school physical sciences, student opinion about whether 
college is worth the effort, and the last 2 digits of the 
student ■ s identification number). 

Table^ 6#2 lists the grouping variables, ordered according to number 

of groups formed (except for the two "identiCieatioh" variables at the 

top which serve as random numbers in our example), 

B, Distributional and Relational Properties of the yariablea 

Table 6.3 lists for each study variable the mean, standard deviation, 

and skewness coefficient and zero-order correlations with SRA4, ACH, and 

SAT, Only the 2,676 students with complete information on all variables 

are used here and later. ^ 



After the bulk of the analyses was completed, it was discovered that 
there were missing observations on the grouping characteristics 
CLIMP, COLEFF, and QCJOB. In addition certain modifications were 
made in the response categories of ANTDEG. In its original form, 
^TDEG formed nine groups. In the results reported here, however, 
students responding •'Other (9)" were droppedi and students anticipating 
any professional degree beyond the masters level (responses 6, 7, 
and 8) vjere collapsed into a single group numbered "5", The si^es of 
the subsamples defined by the acceptable responses to CLI^DP, COLEFF, 
QCJOB, and the modified ANTDEG. ware 2,632, 2,669, 2,637, and 2,646, 
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Table 6*2* Information on grouping variables* 



Variable 
Identification 



pescripcion 



Number of Groups 
After 
Aggregation 



ID2 
IBl 

HSGPA2 

SAT2 

ACH2 

PARING 

REPGPA 

POPED 

ANTDEG 

HSMATH 

HSPHYS 

NOBOOK 

PARASP 

SRAA2 

CLIMP 

COLEFF 

QCJOB 



Last 2 digits of student identification 

Last digit of student identification 

High school's report of student's grade 
point average on a 4--point scale 
(highest 2 digits) 

Highest 2 digits of Total score from the 
Scholastic Aptitude Test 

Highest 2 digits of Total score from the 
Achievement Battery 

Student's best estimate of 1970 parental 
Income before taKes 

Student's report of average grade in 
secondary school 

Student's report of highest level of 
formal education obtained by his father 

Student's anticipated highest academic 
degree 

Student's report of number of semesters 
of higVi school mathematics 

Student's report of number of semesters 
of high school physical sciences 



Student's report of number of books in 
the home 

"What is the highest level of education 
that your parents hope you will complete?" 

Highest digit and sign of composite 
academic self -opinion 

"My grades are markedly better in courses 
that I see I will need later," 

"I often wonder if four years of college 
will really be worth the efforts" 

"I often wish that I were offered a good 
job now so I wouldn't have to spend four 
years in college- 



100 
10 
23 

13 
10 
10 
7 
6 
5 
5 
5 
5 
5 
5 
4 
4 
4 
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Table 6*3, Means, standard deviations, and skewiiess' coef f icien . s of 
study variables s and the Earo-order correlations of each 
variable with SRAA, ACHj and SAT. 



Variable 
Name 


Mean 


Standard 
Deviation 




i, 4= IS J. 

SRAA 


a t ion 
ACH 


Wl Lfl 

SAT 


SRAA 


0.008 


1.006 


.223 


1.000 


.529 


.574 


ACH 


84.766 


15.463 


-.364 


.529 


1.000 


.839 


SAT 


1068,846 


177.209 


.068 


.574 


.839 


1.000 


ID2 


49.561 


29.126 


.003 


.019 


.020 


.008 


IDl 


4.453 


2.865 


.011 


-.033 


-.042 


-.047 


HSGPA2 


3,157 


.469 


-.067 


.370 


.535 


.488 


SAT2 


10.235 


1.798 


.064 


.566 


.827 


.987 


ACH2 


8.024 


1.572 


-.333 


.522 


.983 


.827 


PARING 


6,308 


2,289 


-.234 


.064 


.070 


.076 


REPGPA 


3.203 


1.284 


.232 


-.455 ■ 


-.490 


-.469 


POPED 


3.987 


1.418 


-.321 


.145 


.139 


.157 


ANTDEG 


3.867 


.959 


.687 


.264 


.156 


.140 


HSMATH 


4.332 


.879 


-.260 


.202 


.479 


.346 


HSPHYS 


2.623 


.977 


.319 


.209 


.318 


.257 


NOBOOK 


4,104 


.978 


-.769 


.196 


.146 


.203 


PARASP 


4.458 


.626 


-1.523 


.172 


.066 


.087 


SRAA2 


.005 


.689. 


.399 


.885 


.476 


.520 


CLim 


2.201 


.821 


.304 


.074 


.147 


.165 


COLEFF 


2.695 


.951 


-.209 


.189 


.134 


.114 


QCJOB 


3.330 


.821 


-1.151 


.199 


.105 


.118 
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Tha variables basad on the jituclanc^s identification number (ID2 and 
IDl) have rectangular distributions as expected, Their intercorrelations 
with tlm main variables are close to aoro. They satisfactorily represent 
Category IV ("random") grouping. 

In this sample^ parental income CPARINC) is weakly related to 
achiavement ^ aptitude, and academic self-ratings ^ with correlations not 
much larger than those from the essentially random ID variables (POPED 
and NOBOOK) . 

Anticipated highest degree (ANTDEG) and parental aspirations (PMASP) 
correlate moderately with each other. (*39), but do not correlate with 
other grouping variables. Both cgrrelate higher with SRAA than with ACH 
and SATs perhaps because of similar biases or sets in all student self-= 
report measuree. 

The grouping variables generated from ACIIj SRAA, and SAT (ACllSj 
SRAA2j and SAT2) and the indicators of high school grades (HSGPA2 and 
REPGPA) have substantial correlations with the main variables (ACHs SRAA, 
and SAT), In general these correlations follow predictable patterns, 
ACH2 correlates highest with ACHj next highest with SAT. SAT2 correlates 
highest with SAT, next highest with ACH. SRAA2 correlates highest with 
SRAA, and the order of its correla|lons with SAT and ACH is the same as 
for SRAA, HSGPA2 has stronger correlations with the two total test 
scores than with academic SDlf-rating, The profile of correlations for 
REPGrA is flatter than that from HSGPA2, but it maintains the same order 
of magnitude* 

"respectively* An examination of the means, standard deviationsj and 
Intcrcorrelations of SRAA, ACH^ and SAT for these subsamplcs did not 
indicate any consistent and important deviations from the estimates 
based on the entire 2^676 observations* 
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TliG predomlnancG of four- and five^-choice varlnbles has the advan-- 
tagB of oasy convertabillty to group classifications and the disadvantage 
of low reliability. The substantive importance of those short scalas . - 
lies in the diversity of their patter^ of ^orfelation with achievement^ 
aptltudej and acadeinic self-rating* As will be shown subsequently , 
reasonably prDcise'^ estimates of the relations at the individual level 
can be obtained by grouping on some of these variables, while grouping 
by others yields wildly misleading estimates. Determining which 
characteristics ^coincide with high precision in empirical data is 
particularly important at this point in the study of grouping effects. 

C* Review of Factors Affecting Within-Category Precision 
'^^"^ The mechanisms controlling the comparative precision of estimates 
from different grouping characteristiG^i^.ithin a given category vary 
according to category. The four 1:ay "forces" determining precision 
within a taxonomic category are (1) the relative strengths of the 
relations of the grouping variable to the dependent and independent 
variables y (2) the coarseness of the groupings (3) the between-groups 
variation in the independent variable for a given grouping characteristics 
and (4) the distribution of the individual observations among the groups. 



We review briefly the^ manner in which these forces operate, according tc 

the thebry developed earlier, 

1, Strengths of Relations of Z to X and Y 

The standardized regression coefficients best indicate the 

strength of relations Within a given sample. An is introduced 

There is no exact fOrmula.T^ for the ■-precision" of estimation. Precise 
estimaces generally combine small bias (in our case^ B~ - ^YX^ with 

small mean-squared error [USE ^ (bias)^ + SE(Bg^)^] * Wliether bias 

or mean^squared error is more important in defining precision depends 
on the purpose for which the estimate will be used* 
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as a superscript for regression coefficients to denote that the 
coefficients are standardized in this section* In Category III ^ 
according to our theory, variables with the weakest direct relation 

-A ^ 

to Y (small &^^.^) and the strongest relation to X (large S_„) 
yield the most precise estimates* 

The influence is inore Gomplicated in Category I, In general , 
large and small Syz-X greater precision^ More can be 

said if wa fix one parameter and vary the other ^ or consider the ratio 

(a) For fiKed 0 of ,any size, the smaller the value of ^ ^ 

YZ'X 

the smaller the bias* 
"(b) For small (less than ,2 but significantly different from zero) 

values of &yZ*K * larger values of B^^ lead to smaller bias* 
(c) i^enever Byz-X ©^^^^^^ ® Category I variable , a 

particularly poor estimate of g results from grouped 

observations* 
2, Coarseness 

The coarseness of grouping, by which we mean the number of groups 
formed (m) from a fixed number of observations (N) , largely de-- 
termines the efficiency with a Category IV grouping characteristic* 
The strength of relation of Category IV variables to the main variables 
is inconsequential; hence ^ they, group observations in an essentially 
random fashion, and the precision of their estimates is influenced 
only by m . 

Coarseness influences bias and efficiency in other categrrles to 
a lesser degree* If two variables Z^ and have similar relations 

to X and Y (6*2^ - gj^^ ^ bJ^^.j, « ^^.X^ ' "^"^'^ 
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groups is likely to yield estiTnates with smaller bias and higher 
efficiency* 

3. Between-^Groups Variation in X 

Large between--groups variance in the independent variable implies 
small bias and high efficiency. With fixed values of m and relative^' 
ly constant values of S*^.^^ and gj^^ , the grouping variable which 
maximizes the between=groups variance of X yields the most precise 
estimate, . 

4* Distribution of Individual Observations Among the Groups 
The mean of the dependent or Independent variables in a group 
with few observations is unstable. Such means can have a dispropor^ 
tionate Impact on the estlinates from grouped observations* Unpredic- 
table variation of a few group means when tn is small is potentially 
more damaging than the same variation among a large number of 
observations at the individual level. At the group level, the only 
observations are the means. Instability in any cell mean has a 
greater impact on the precision of the parameter estimates than does 
instability at the individual level. When the observations are not 
evenly distributed among the groups, precision can be affected. 

The four forces do not act independently* It makes little sense 
to consider the impact of and ignore the size of g^^ ^ or to 

concentrate on coarseness without considering variation in group size. 
Thus the investigator must keep in mind that the forces can Interact* 
In the discussion of the empirical data^ we will only reluctantly at- 
tribute a loss of precision to a single source. 

II* Rcftresnion of Acadomic Self-Appraisal on Achievement 

As our first example, we regress academic Belf-appraisal (SRAA - Y) 
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on achievement (ACH = X), Alternative models of the reiaClon between ACH 

and SRAA are certainly reasonable. However, we only wish to illustrate 

the effects of groupings and the chosen ordering Is informative. 

At the outset we standardise all variables. The procedure for 

generating group esLimates and judging their precision are invariant with 

regard to linear transf onsations of the variables* Once the observations 

are standardized, the regression coefficient at the individual level (the 

standardized regression coefficient) Is an unbiased estimator of the 

correlation coefficient; l,e,, E(b*^) ^ 0*^ - p^^ in the single regres- 

sor case. Thus we obtain estimates of p^^ when we regress Y on X . 

Under these circumstances ^ comparisons of B~ with h are checks on 

I A YX 

the bias in estimating the indlvldual-ievel correlation coefficient from 
grouped data. (At this pointy we will drop the denoting standardized 
coefficients since all coefficients in the remainder of the chapter will 
be generated from data that were initially standardized, ) • 
A. Regression Coefficients from Ungrouped Data 

According to the analysis of the 2,676 observations^ the equat^Luii 
relating to SRAA(Y) to ACH(X) is 

SRAA ^ .529 (ACH) , 

That Is, the slope of the regression is b^^ ^ .529 * (The intercept la 
essentially 0 since all variables were standardized,)* Also, 

SE(byj^) - ,0032 

and 

R|^^ ^ *281 (the squared multiple correlation coefficient). 

In a study such as this, the investigator usually generalizes beyond 
the 25676 students Included in the analysis. After all, these students 
are not even the entire freshmen class entering this university during 
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the 1971^72 acadGmic year, ^ Apparently, our deletion of subjecUs did 
leave a representative sample of the freshmen class, ^ 
B. Categorization of Grouping Variables 

To classify grouplno variables (Z) Into taKonomlc categories 

requires information beyond that in Table 6*3. Table 6.4 contains for 

each Z , estimates of the regression coefficients (B S _ _^,S _, ^ 

^XZ^ ^nd their standard errors (in parentheses below) , An estiinate of 

the between-groups standard deviation, o- ^ of ACH for each of the 

X 

grouping variables is also given. 

The taKonomy introduced in Chapter 3 categorizes on the basis of the 
magnitude of 0yz*X ""^ ^XZ ' Operationally , for initial categoriza- 
tion, we require that SyZ'X ^"^ ^XZ ^ tlines^ their standard 
errors to be considered significantly different from zero* This rather 
stringent criterlort leads to the followiug category assignments 
[Variables within categories are ordered by the number of groups they 
form (m) . ] i 



^A total of 5 J 230 students completed the questionnaires during orienta- 
tion of the 1971-72 academic year* Other students enrolled without 
attending orientation or participating In the orientation tests and 
survey. Students who did not begin Fall term were also excluded from 
the 5,230 total* 

^An early computer run (carried out before SRAA was created and 
before the subtests composing the achievement battery were combined 
to obtain the total achievement score) based on the 4, 241 freshmen 
with reported SAT scores indicated that our students are like their 
fellow classmates- The average student in our sample performed 
slightly better on the SAT (1059 to 1054) , about the same on the 
achievement battery (85 to 84) , and had the same high school grade 
average J and reported a slightly higher parental income. The rela^ 
tionshlp between SAT and PARING was somewhat stronger (0;109 
compared to 0*076) for the 3,647 students with SAT scores wh6~^^s6 
reported their parents' income than for the students in our sample. 
Difforences in means, standard deviations , and intercorrelatldhs on 
other characteristics were minor also* 
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Table 6,4, Estimates of parameters relating ACH(X) and SRAA(Y) to 
possible grouping variables (E)®. 



Variable 
Mains 


Group Size 




Par. 

0V7 • Y 
A 


amecer Estioates 


X 


ID2 


1 nn 


(.0164)- 


• UUO 

(.0164) 


.020 
(.0193) 


.019 

(.0193) 


. 189 




1 n 


(.0164) 


(.0164) 


-.042 
(.0193) 


-.033 
(.0193) 


.078 


noui fife 




/i ft ^ 

(.0193) 


(,0193) 


.535 
(.0163) 


.370 
(.0180) 


,552 






1 o/. 

(.0282) 


/i n£ 
■ ^Ub 

(.0282) 


.827 
(.0109) 


.566 
(.0160) 


. 831 






/i fin 
(.0896) 


, U/U 

(,0896) 


.983 
(.0035) 


.522 
(.0165) 


. 984 


PA"BTMn 


1 fi 


(,0164) 


AO O 

(.0164) 


.070 
(.0193) 


.064 
(.0193) 


.122 






/. ri*^ 
• J 

(.0182) 


(.0182) 


-.490 
(.0169) 


-.455 
(.0172) 


.510 




Q 


(.0165) 


A^ ^ 
• (J/ J 

(.0165) 


.139 
(.0192) 


.145 
(.0191) 


.150 




e 


ADO 

(.0162) 


, loo 
(*0162) 


.156 
(.0191) 


.264 
(.0186) 


.159 






(.0187) 


A/" £ 

(*Ci87) 


.479 
(.0170) 


.202 
(.0189) 


.489 






(.0173) 


(*0173) 


.318 
(.0183) 


.209 
(.0189) 


.365 




e 


* ^11 

(,0164) 


TOO 

« 122 
(,0164) 


.146 
(.0191) 


.196 
(.0190) 


.148 




D 


■ j4U 

(.0162) 


(.0162) 


.066 
(.0193) 


.172 
(.0190) 


.077 


SRAA2 


5 


.139 
( , 0099) 


,819 
(.0099) 


.476 
(.0170) 


.885 
(.0090) 


.481 


CLIOT 


4 


.530 
(.0166) 


-.003 
(.0166) 


,147 
(.0191) 


.074 
(.0193) 


.163 


COLEFF 


4 


.513 
(.0164) 


.121 
(.0164) 


.134 
(.0192) 


.189 
(.0190) 


.144 


QCJOB 


4 


.514 
(.0163) 


.145 
(.0163) 


.105 

(.0192) 


.199 
(.0190) 


.113 



All variableB have been standardised prior^ to grouping so that 



Numbers in parentheses are standard errors of the regression coefficients. 
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Category I Canagory III 

HSGPA2 HSMATH ACH2 

SAT2 NOBOOK 'PARING 



|6^«li3SE(e ) ANTDEG PARASP HSPHYS 

" REPGPA COLEFF CLIMP ' 

POPED QCJOB 
SRAA2 

Category II Category IV 

1 1 < SSECe^^) (NONE) ID2 

"~ ^ IDl 

As we mentioned previously , no characteristics belong to Category 
II, and the number falling in Cat agory I is large, SRAA2 and ACH2 
are special cases within Categories I and III. These ^ raapectively , are 
the best approKimatlons to what Blalock (1964) and Hannan (1970, 1971! 
1972) have called "grouping on the dependent variable" and "grouping on 
the independent variable"* 

C* Prediction of Bias from Grouping 

A modification of the bias formulas ([3, 19*]* [3.28']j [3.29], 
[3-31]) from pages 63 and 64 can be-used to predict the bias from 
grouping for our empirical eKamples. Remembering that .0 - a - o^-l ^ 
our equation for estimating the bias from grouping on a particular Z 



X 



Is given by 

' - . /I -Si 

This approKimation Is particularly good when the sample either equals 
the population or Is very large. The small sample properties of 9 
are less predictable when both g and 0 are non-zero. We have 

-included academic self-appraisal in our example because this type of 
data Is often collected anonymously. If so^ we cannot correlate ACH 
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and SRAA at the individual lavol. The data collectad in this study 
were completely identified ^ and thus the results under constraints of 
anonymity can be compared with the results from completely identified 
data. 

As pointed out in Chapter 1 [discussion of Problem (D) ] , one way 
to handle anonymous data is to analyze relations at the group level. 
For eKample^ students can be asked to indicate their number of 
semesters of high school mathematics (HSMATH) when they complete the 
attitude quest ionnaries anonymously. SRAA and ACH scores can then 
be grouped according to students* HSMATH responses, and the regres- 
sion of SRAA on ACH can be estimated from the weighted group means 
of SRAA and ACH - - 

To be surCj we are still not able to estimate ^yz X ^i^s^^ly 
since cr^ cannot be daterTnined, Thus we cannot- estimate grouping 
bias e , But the estimate of 0 can be used in place of the 
unobtainable estimate of ^yZ'K equation for bias. This 

substitution yields a function of the estimated grouping bias. 

k 

1^ 



[5.2] ^«(._2i^ie 



YZ'X 



^1 - ai\ 



YZ'"^XZ *2 

In most cases s enough is known about the covariance of X and Y 
to determineat least its sign* t^en g^^ is positive (negative) and 
^XZ ^XY ^^^^^ sama sign, provides an upper (lower) bound 

for ^YZ*K ' ^^^^ B^^ and o^^^ have opposite signs, becomes 
a lower (upper) bound* Thusj we eKpect small tt values to occur with 
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good estimators of g^^ and large tt values with poor estimators. 

Table 6*5 lists for each grouping variable, the predicted biases 
* ft 

(both 8 and tt ) in estimating the coafficient from the regression of 
SRAA on ACH * Latere we shall compare these values with the 
observed biases resulting from grouping. 

Estimates of Ragressions from. Different Groupliig Methods 
Two standards are applied for judging the precision of estimating 
^YX ^^^^ grouped on a given Z , Firsts estimates of bias and 

efficiency from grouping on different variables are compared ^ both 
within and between categories* These comparisons focus on the effects 
of within=variable factors on precision and on the relative precision 
of different categories of variables* 

We also examine precision on an absolute scales i.e.. Independent-- 
ly of the scaling of ACH and SRAA , To do this, we (a) compare 
observed and predicted bias from grouping with twice the standard error 
of its estimate, SE(B^=) | and (b) examine indices of afficiency 
generated from the ratio of the mean--squared error from ungrouped data 
to the mean-squared error from a particular grouping. Since these 
standard indices of efficiency tend to be small due to the coarseness 
of groupings we also compare the efficiency of a particular grouping 
with the efficiency of forming an equal number of groups randomly 
(m-l/N-1) . 

!• Relative Precision by Category 

Table 6*5 contains estimates of the regression coef f iclents ^ 
their standard errors, the observed and predicted grouping biaSj and 
estimates of the square root of the mean=squared error of each 
grouping variable. The grouping v^iriables are ordered within cate- 
gories by the si^e of the observed bias sKCGpt for ACH2 and SRAA2 , 

. 162 



149 



Table 6.3, Estimates from grouped data of coefricients describing the 
regression of SRAA on ACH, 



Grouping 
Variable 



Nuinber 

of 
Groups 
(m) 



Bias 
Observed 



lias Predicted 
from 
e 7T 



CaCegory IV 

1D2 100 

IDl ' 10 

Category III 

ACH2 10 

PARING* 10 

HSPIIYS 5 

CLTbW 5 

Q 

Category I 

SRAA2 5 

IISMATH 5 

SAT2 13 

HSGPA2 23 

POPED 6 

REPGPA 7 

NOBOOK 5 

COLEFF 4 

ANTDEG 5 

QCJOB 4 

PARASP 5 



.558 
.442 



,531 

,558 
,571 
,717 



1.853 

.414 
.671 
.702 

.911 
.917 
1.334 
1.461 
1.631 
1.853 
1.946 



,029 
,087 



.002 

.029 
,042 
,188 



-,115 
.142 
.173 
.382 
.388 
.805 
.932 
1.102 
1.324 
1.417 



,004 
,075 



.002 

,130 
.095 
,016 



1.324 1.295 



-.100 
.150 
.150 
.440 
.360 
.800 
.765 
1.117 
1.188 
1.519 



,040 
,225 



.142 

.295 
.433 
.401 



1.507 

.307 
.210 
.451 
.874 

.635 
1.285 
1.194 
1,586 
1.630 
2.048 



,0739 
,1831 



.0615 

.1314 
.0915 
.3971 



.0631 

.0248 
.0670 
.0287 
.1626 
.0617 
.1133 
.1160 
.2680 
.3533 
.7339 



,0794 
,2027 



.0615 

.1345 
.1294 
.4382 



1.3255 

.1176 
.1570 
.1753 
.4152 
.3929 
.8129 
.9392 
1.1341 
1.3703 
1.5958 



Estimates from ungrouped daCai b = .529; SE(b „) = .0032, 



MSECB--) ■■= /Fob 



SERVED DIAS)^ + [SE(B")]^ 

YX 



*With the cKceptlon of ACH2 and SliAA2 , variables within catcgorlGs are 
ordered on the basis of observed blaSi 
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which are listed fir.^t in their respective categories. 

In general p the estimates conform to our expectations though the 
bias and niean-squared error (MSE)^ are enormous for some Category I 
variables. Category IV grouping yielded estimates with small bias. 
In facts only grouping on ACH2 Cgrouping on^ the independent variable) 
gives better precision (small bias and small mean-*BquarBd error) than 
the estimate from ID2 . But it took ten times as many groups to 
achieve this level of accuracy* 

The bias from grouping by IDl ^ the other Category IV variable, 
is three times as large as the bias from grouping on 1D2 . Its 
estimated MSE is mors than six times larger than the MSB from ID2 * 
Category III grouping yields smaller bias than grouping by IDl in 
three out of four cases ^ the eKception being CLIOT which forms less 
than half as many groups. Certain Category I variables yielded esti- 
mates with smaller MSE's, Clearly , random grouping should be avoided 
unless many groups can be formed and no Category II variable is readily 
available* 

Three of four Category III variables yielded highly satisfactory 
estimates with small MSB's* The estimate from grouping on ACH2 is 
the most efficient of all estimates generated. 

Only CLIOT among the Category III variables yielded an estimate 
with considerable bias and large MSE , The regression coefficients in 
Table 6*4 suggest that CLIMP acts as a suppressor when it enters the 
model with ACH and SMA , As mentioned earlier , the small number 

^In table/€*S, /MSE was used instead of MSE for possible compari- 
son with SE(B~) , In the discussion that follows t4lSE" and MSE . 

are interchangeable. 
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of groups formed by CLIMP also has detrimental effects on the precl^' 
sion of its estimate. 

Three Cotegory I variables, HSMATH, SAT2, and HSGPA2 , yielded 
precise estimates of B relative to tlie other Category I groupings. 
All have substantially larger Eero-order correlations with ACH than 
with SRAA ^ and their between-groups standard deviations of ACH are 
large* 

The remaining Category I variables, including SRM.2 ^ yield esti- 
mates with large bias and large MSB , At the eKtreme (PARASP) j B™ 
is almost four times the ungrouped b ^ and has a MSB 200 times, the 
MSB of b^^ . 

As Blalock and Hannan have stated, grouping on the dependent vari- 
able is disastrously biased* The unmeasured factors represented by the 
disturbance term in the initial linear model (Equation [3*1]) are 
confounded with the effects of the priinary regressor to such a degree 
that the relation of ACH to SRAA is unrecognisable* 

Fortunately s there are warning signals of poor estimation from 
Category I grouping, even when anonymously collected data prevent 
estimation of a * 0£ the eight Category I variables that yielded 
the largest biases, all except EEPGPA had higher zero-order corrala- 
tlons with SRAA than with .%CH (i.e.^ r^^ > r ) , With SRAAa, 



^We must re^emphasize that the superiority of a particular grouping 
variable is a function of the relation to be estimated, l^hen we 
instead regress ACH on SRAA ^ for which b^^ - .812 ^ grouping by 

ANTDEG CB~ ^ .851) and QCJOB (B— ^ *751) result in small bias 

while grouping by HSPHYS (B~ ^ 2*452) and PARING (B-^ - 1,848) 

result in large bias. The standard errors for MTDEG and QCJOB 
are also small for this regression. The question to be answered 
determines the quality of a particular char acteristic for grouping 
purposes* - 
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AKTDEG, QCJOB, or P^\R/iS? as the grouping variable, Syz-X becornes 
larger than . Alao^ the worst Category I variables (MTDEG, QCJOB, 

PARASP) create a small a- , and distribute their observations unevenly 
among a few initial groups »^ 

2. Precision Independent of Scaling 

There are no universal standards for Judging what are acceptable 
values for bias and mean-squared error* The purposes for which an esti- 
mate is to be used determine what is ^-suitably precise". However, we can 
begin to set standards for acceptable bias and efficiency from grouping 
which are invariant under scalar transformations of variables. 

We suggest that the investigator compare the predicted bias (6) 

from a given grouping with twice the standard error of its corresponding 

estimate (B^-) , If 9 is larger than 2 SE(B— ) , drop the grouping 

variable from consideration* Selection among the^ remaining grouping var-- 

lablas can be based on the size of 6 ^ on the efficiency of estimation^ 

or on some ot-^ar criteria (a^g., ease of collection or number of groups). 

To Judge the efficiency of a given estimate, the investigator can 

calculate fff (b^^^.B^-) « MSE(b^^) /MSECB^g) * ^^^yX^^YX^ should 

be as large as possible. Certainly, variables with efficiencies smaller 

than the worst of the Category IV variables should be eKcluded. As 

a further comparison ^ we suggest that the investigator calculated the 

ratio Eff(b^ B~)/Ef?(b^„,B. . ^ , . ,) • This 

YX YX YX (m .random groups of equal size) 

ratio provides some indication of the gain over random grouping in 
each case* 



^The lowest two groups of ANTDEG's five groups contain fewer than 
100 Qbservations, Eighty-six (86) per cent of the observations on 
QCJOB are in two of its four categories. Ninety-seven (97) per 
cent of PARASP' s werta either "complete college-- (4) or "obtain a 
graduate or professional degree" (5) * 
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If we follow Uhese guidelines in our eKamplei we obtain the 
results depicted in Table 6*5. We have also compared the observed 
bias to 2 SE(B-j) in the table. With the 2 SE(B") as a cri« 
terion, all Category I variables are excluded, and all Category III 
and Category IV variables are retained, regardless of whether we 
look at observed bias or 6 . 

In every case, efficiency is small, but this oecuri because of 
the small m values for almost every variable. If we compare the 
efficiency of each systematic grouping with that of grouping by 
IDl , we can exclude the worst Category III variable, which was 
previously retained. Furthermorej there are marked improvements 

in efficiency relative to random grouping for all Category III j 

■ / 
groupings and for the best of Category I grouping variables. 

The variables that remain after applying eKclusion principlea 

for both bias and efficiency yielded estimates with the smallest 

biases and smallest MSB's overall. In Section 6,11, F, we / 

suggest how the investigator might comblna his best estimatee^whan 

he does not wish to select Wmong^h^emT^^^^ / 

Ef^ Predicted Bias vs. Observed Bias- j 

Despite the specification and measurement errors, our predlLctions 

(Table 6.5) as to bias stood up well. For every grouping where the 

observed bias was greater than ,2^ 6 was greater than .2 . With the 

exception of PMINC , the predicted bias was less than .1 whenever the 

observed bias was less than ,1 . 

The. prediction from 6 worked poorly only for IDl and CLIOT . 

In the case of IDl , it is the sign reversal that troubles us and not 

the siae of the error. There seems to be no reasonable explanation for 

the sign reversal other than the use of few groups with a random 
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Table 6.i.^. Comparison of estimates from grouped data uaing different 
criteria for .acceptable bias from the regression of SRM 
on ACH, 



Grouping 


I Observed 

Bias 1 
<2 SE(B~) 


i PrGdict od 
1 Bias (6)1^ 
|<2 SEfB— ) 

\ 




1 ^ 

1 -"^Y^'^Yr 


Varia 


ble 


Eff fb' B—) 


1 v.^Y^s ranaora '^^^j/ 


Catagory 


iv 




1 


- - 


i__ • ■ 


ID2 


(100) 


+ 


1 + 


.040 


1.08 


IDl 


( 10) 


+ 


1 + ■ 


.016 


4.71 


Category 


b 

III 




1 

! 

1 




■ 








1 ' 


.052 


15.29 


PARING 


( 10) 




1 + 


.024 


7.06 


HSPHYS 


C 5) 


+ 


1 

1 + 


1 

.025 J 


16.67 


climI 


( 4) 


+ 


1 + 
1 


.007 

! 


6.36 


Category 






1 






SRAA2 


C 5) 




1 _ 


.002 1 


1.33 


HSMATH 


( 5) 




— 


1 

-027 1 


J 18.00 




C 13) 






.020 1 


4.44 


HSGPA2 


( 23) 






.018 1 


2.20 


POPED 


C 6) 






.008 1 


4.21 


REPGPA 


C 7) 






.008 1 


3.64 


NOBOOK 


C 5) 


. 1 




.004 j 


2.67 


COLEFF 


c ,:4) 


' 1 




' .003 1 


2.73 


ANTIDEG 


( 5) 


1 




.003 j 


2.00 


QCJOB 


C 4) 


1 




.002 j 


1.82 


PimASP 


C 5) 


1 




.002 1 


1.33 



- Within bounds ^f acceptable bias 
m Outside bounds of acceptable bias 

8 

With the exception of ACH2 and SRAA2 , variables within categories 
are ordered on the basia of observed bias (see Table 6,5). 
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grouping variablr* In the previous section, we provided an explanation 
as to why estimates from CLIMP grouping might be disappointing (its 
suppressor relation with ACH and SRAA and Its smaller number of 
groups) ; this explanation may also hold for poor prediction from CLIMP 

Every value of tt proved to be.larger^ than-the ^^^^^^ bias. The 

Category III and Category IV variables along with the three Category I 
variables with the smallest bias yielded the lowest values of tJ . 

Composites of ^Estimates from Multiple Grouping Variables 
The above findings suggest that an investigator can distinguish 
those grouping characteristics which lead to reasonably accurate 
estlBiates from those providing extremely misleading ones In empirical 
studies similar to ours* Once this separation has been accomplished ^ 
the Investigator can choose a characteristic with small predicted bias. 
Better yetj#he can use the available information about each characteris- 
tic and Its expected bias to form a weighted composite of good grouped 
estimates. For example ^ he can weight grouped estimates in an Inverse 
proportion to their predicted bias. Alternatively , he can give addition- 
al weight to the more stable estimates* 

Table 6*7 provides two examples of composite extimates* In 
Example^A), we assume knowledge of . so that 6 can be used. In 

Example (B) * ^® treated as unknownj and thus the tt values are 

used to weight the estimates. In each example, five of the seven group- 
ing variables with the smallest predicted bias are used. We exclude 
ID2 as redundant with IDl , and because it forms many more groups than 
any other variable, ACH2 is excluded on the grounds that compositing 
would be unnecessary if grouping on ACH2 were possible. Three sets of 
weights are determined i (1) in Inverse proportion to the predicted bias, 
(2) Inverse proportion to SECb—) and (3) in inverse proportion to the 
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Table 6,7. Weighted Composites from grouped astimates of 6 ' from 
the regression of SRAA on ACH* 



Grouping 
Variable 






1 

1 


Predicted 

Til 1 e 


Weight 


Weight 


Weight 


(A) WeiBhts 


based 


on 6 














.717 


.3971 




.016 


.243 


.130 


.162 


IDl 


.442 


.1831 




.075 ■ 


.207 


.195 


.202 


"HSPHYS 


.571\ 


.0915 




.095 


.195 


.222 


.217 


HSMATH 


.414 


.0248 




.100 


.192 


.242 


.232 


PARING 


.558 


.1314 




.130 


.174 


.210 


.188 


Estimates yielded by tha 


weights ™ 


.562^ 


.531^' 


- .538* 


(B) WeiRhts 


based 


on TT 












SAT2 


.671 


.0670 




.210 


.213 


.229 


.236 


IDl 


.442 


.1831 




.225 


.211 


.224 


.226 


PARINC 


.558 


.1314 




.295 


.199 


.209 


.202 


HSMATH 


.414 


.0248 




.307 


,197 


.242 


.226 


GLUffi 


.717 


.3971 




.401 


.180 


.126 


.111 


Estimates yielded by the 


wei| 


jhts — 


.568* 


.566* 


.554* 


^In both examples s 
bias (6 or it) • 


the groupln 


i Vi 


iriables are 


ordered,. 


.by the predleted 



The B~ were tranBformed to Fisher Z's before weighting and averaging, 

^Weight (1) = {[E (Predicted bias (Z.)] - [Predicted bias (Z.)]} / 
42 [(Predicted bias (2^)]. " ^ 

^Weight.C2): fEESECB^j) J - SECB^^).} / 4E[SECB").]. 

^Weight (3) = {[Weight (1) for Z^] [Weight (2) for Z^]} / [ E (numerator) ] . 
*cf. b^^ = .529 
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predicted bias and SE(B~) . Since obatrvations were initially 
standfirdiaed^ the B~ were tranoformod to Fisher Z's before weighting 
and averaging. 

The resulting weighted composites are highly satisfactory. All 
compoBltes were within ,04 of b^^ , Estimate A(2) equals the 
estimates from grouping on the independent variable ACH2 , The remain- 
ing composite estimates do nearly as well, equaled or exceeded only by 
grouping on ACH2 , and in some cases, by grouping on ID2 and PARING, 
Clearly, judicious weighting of grouped estimates can lead to precise es- 
timation of the ungrouped regression coefficient* 

III* Regression of Achievement on Aptitu de 

In our next example, we estimate the regression coefficient of 
achievement test performance (ACH) on aptitude test performance (SAT) . 
Anonymity is not usually a problem in this case, but grouping could be 
economical* Thus we assume that is known and limit discussion to 

the full-information situation* 

This example will be considered In much less detail. Our primary 
purpose in this second empirical example is to Illustrate that the 
suitability of a grouping variable depends on its relations with the 
main variables* We again standardized all variables prior to conducting 
the analysis, 

A, Regression with Ungrouped Data 

The equation relating ACH(Y) to SATCX) la 

. ACH * ,839 (SAT) 

with 

SE(by^) ^ ,0105 

and 

171 



158 



B. Categorization of Grouping Variables 

Table 6.8 contains estimates for each grouping of the regression 
coefficients (Byx*E' ^YZ-X' ^XZ' ^YZ^ and their standard errori 

The betweon^group standard deviation j - o^ , of SAT for the grouping 
variable is also included. 

Again, we required an estimate of either S^^.v ^v- 

It z * X xz 

exceed three times its standard error to be considered significantly 
different from zero. The resulting categorization was as follows^ 



Category I Category III 

HSGPA2 ^TDEG SAT2 PARASP 

B - > 3SE(B ) ^^^2 HSMATH PARINC CLIOT 

XZ - ~ - XZ^ REPGPA HSPHYS FATHED QCJOB 

SP.AA2 COLEFF NOBOOK 



Category II Category XV 

ID2 
IDl 



< 3SECI^^) (HONE) ID2 



Categories of several variables in the ACH"On=-SAT regression 
differ from their categories with respect to the SRAA-'On^ACH regres- 
sion* ACH2 j which now represents grouping on the dependent variable 
rather than on the independent variables moves from Category ^ III to 
Category I. HSPHYS also moves due to its correlation with ACH * The 
relativa sizes of r^^ and r^^ ff^gain~serve~cr^use to poQr 

grouping variables since r^^ is larger than r^^ in six of the eight 
Category I groupings* 

The number of variables in Category III is striking. Of the seven 
Category III variables in the ACH-^on-SAT rGgrcsaionj five were in 
Category I in the regression of SRAA on ACH . The correlations of 
the Category III variables with ACH and SAT do not differ greatly 
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Table 6.8. 
» „ 


Estimates of 
alternative 


parameters relating SAT (X) and; ACH(Y) to 
grouping variables (Z)^, -[''. -^rf:.-:.' , 








Parameter Estinates 




Variable 
Name 


Group S±7.m 
(m) 


1 


B 




ft 

: Yz 




ID2 


100 


.839 
(.0105) - 


,014 
(,0105) 


.008 
(,0193) 


.020 
(.0193) 


.186 


IDl 


10 


.839 
(.0105) 


-.003 
(.0105) 


- 046 
(.0193) 


- 049 
(^0193) 


. 069; 


HSGPA2 


23 


.759 
(.0116) 


.164 
(.0116) 


488 
(.03 69) 


;* J J J 

(.0163) 


.517 


SAT2 


13 


.884 
(.0662) 


-,042 
(.0662) 


m 3 O f - 

(.0031) 


(.0109) 


.989 


ACH2 


10 


.082 
(.0061) 


.916 
(.0061) 


827 
(.0109) 


(.0035) 


.835 


PARING 


10 


.838 

(.0106): 


.006 
(.0106) 


(.0193) 


fi7n 
; (,0193) 


.146 


REPGPA 


7 


,781 
(.0117) 


-.124 
(,0117) 


-.468 
(.0171) 


* " 3? W 

(.0169) 


.498 


POPED 


6 


.838 
(.0105) 


. 007 
(.0106) 


(.0191) 


1 

(.0192) 


.169 


ANTDEG 


5 


.834 
(.0106) 


039 
(.0106) 


1 An 

(.0192) 


(.0191) 


.141 


HSMATH 


5 


,765 
(.0104) 


.214 
(.0104) 


.346 
(,0181) 


(,0170) 


.349 


HSPHYS 


5 


,811 
(.0107) 


,109 
(;01Q7) 


*257 
C^OIS?) 


(.0183) 


.294 


NOBOOK 


5 


. 844 
(,0107) 


-.025 

(;oio7) 


(.0189) 


(.0191) 


.204 


PARASP 


5 


.839 
(.0106) 


-.007 
(.0106) 


.087 
(.0193) 


,066 
(.0193) 


.101 


SBAA2 


5 


.811 
(,0123) 


.054 

(.0123) 


.520 
(.0165) 


.476 
(.0170) 


.531 


CLDff 


4 


,838 
(.0107) 


.009 
(.0107) 


.16i. 
(.0191) 


.147 
(.0191) 


,185 


COLEFF 


4 


.835 
(.0106) 


,039 
(.0106) 


.114 
(.0192) 


.134 
(.0192) 


,134 


QCJOB 


4 


.838 
(.0106) 


.007 
(,0106) 


.118 
(.0192) 


.106 
(.0192) 


.123 



All variables have been standardized prior to grouping so that 

^'y " °x ' " 1' ^xz " Pxz^ ^Yz " Pyz- 



Numbers in parenthesis are the standard errors of the regression 
coefficients. 1 'TO 
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in magnitude though the corrclatiofi of each 2 with SAT is always 
larger than its correlation with ACH * The shift of| SAT2 from 
Category I to Category IH was expected; it now represents grouping on 
the independent variable. The remaining variables apparently enter 
Category III in part because of the strong correlation between ACH 
and SAT , which the model apportions to the independent variable SAT , 
C, Estimates of Regressions from Different Grouping Methods 
Table 6.9 contains estimated regression coefficients and other 
Information* With a few minor exceptions j the results conform to our 
expectations* ' 

The praaision of Category IV grouping again is strongly related to..^ 
the number of groups. The accuracy (bias) and stability (MSE) of 
grouping on ID2 is exceeded only by grouping on the independent 
variable CSAT2) , Grouping on IDl yields a poorer estimate than 
grouping on ID2 ^ on any Category III variable , and on half of the 
Category 1 variables. 

Category III grouping is clearly superior overall to grouping on 
variables from other categories* Observed bias is smaller than 
2 SE(B~) for 5 of 7 Category III variables, (See Table 6.10.) The 
exceptions are QCJOB and NOBOOK 'which form few groups with an 
uneven distribution of observations among the groups, 

BRAA and COLEFF are the only Category I variables for which the 
observed bias falls within 2 SECB—) • The estimates from the 
Category 1 variables ^ other than SRAA2 , in addition to yielding large 
bias, are about as'^inWficient as grouping on IDl . 

The decision rules discussed in Section 6.II,D are also ucaful 
with this example. If a variable is eliminated when (a) |o| £ 2SECB") 
or (b) Eff (B^jb) £ Ef f (B^^j^^ ;b) , only NOBOOK among the Category III 
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Table 6.9* Estimatea from grouped data of coefficients deecrlbing the 
regression of A^'ll on SAT, \ 



Variable 


Number 
of 

(m) 




Observed 


Bias 
Predicted 

1 ironi 

e 






CATEGORY IV 














ID2 


100 . 


.832 


-.007 


.003 


,0590 


.0594 ) 


IDl 


10 1 

} 


1^053 


.214 


.029 


.2168 


,3036 


CATEGORY III 


























PARINC 


10. 


.817 


-.022 


.021 


.0598 


.0636 


GLIMP 


4 


.876 


.036 


,042 


.0388 


.0528 


POPED 


6 


.877 


.039 


.038 


.0685 


.0788 


QCJOB 


4 


.912 


.073 


.054 


.0216 


.0775 


FARASP 


5 


,744 


-.095 


-.059 


.0903 


.1310 


NOBOOK 


5 


.718. 


-.121 


-.174 


.0372 


.1266 


CATEGORY I 














ACH2 


10 


1.168 


.329 


.329 


.0541 


.3338 


SRAA2 


5 


.899 


.060 


.072 


.0543 


. .0809 


RBPGPA 


7 


1.019 


.180 


.176 


.0418 


.1848 


COLEFF 


4 


1.054 


.213 


.241 


.1169 


.2438 


HSGPA2 


23 


1.057 


.218 


.219 


.0329 


.2205 


ANTDEG 


5 


1.120 


.281 


.271 


.0607 


.2875 


HSPHYS - 


5 


1.237 


.398 


.296 


.0422 


.4002 - 


HSMATH 


5 


1.396 


.557 


.531 


.0478 


.5590 


Estimates from 


ungrouped data 


: b - , 
YX 


8391 SE(by^) 


« .0105. 





With the exceptiDn of ACH2 and SAT2s variables within categories are 
ordered on the basis of observed bias. 

175 



152 



Table 6,10, Comparison of estimates froni grouped data using different 
criteria for acceptable bias in the regression of ACH 
on SAT. 



Grpuping 


1 

1 Observed i 

T5 4 _ _ 1 1 

Jsias 1 ' 
<2 SE(B-j-) 1 

" 1 


I Predicted 
Bias (8) 




1 


Variable 
(nt) 


EffCb B=-) 


|EFF(b„„i random Z. O 
^ XA (m) 


Category IV 


1 






1 


ID2 


(100) 


1 




.177 


4.78 


IDl 


( 10) 






,034 


10.00 


Category III 


i 

1 




1 




SAT2 


C 13) 


' i 




.553 


122.89 


PARING 


C 10) 


+ 1 




.165 1 


48.53 


GLIOT 


C 4) 


1 


+ 


.198 j 


180.00 


POPED 


C 6) 


• + i 


+ 


.133 j 


70.00 


QCJOB 


( 4) 


1 




.135 1 


122.73 


PARASP 


( 5) 


+ 1 




.080 1 


53.33 


NOBOOK 


( 5) 


1 




.083 1 


55.33 


Category 


1 

1 




1 

-1 




ACH2 


C 10) 


I 




.031 1 


9.12 


SRAA2 


( 5) 


+ 1 


+ 


.130 1 


86.67 


REPGPA 


( 7) 


i 




.057 1 


25.91 


COLEFF 


( 4) 


+ 1 




.043 ' 


28.67 : 


HBGPA2 


C 23) 


1 




.048 1 


5.85 


ANTDEG 


< 5) 


1 




.037 1 


24.67 


HSPHYS 


C 5) 


1 




.026 1 


17.33 


HSMATH 


c sr 


■ _ 1 

^ ' 1 




.019 1 
f 


12.67 



^Within bounds of acceptable bias, 
^ Outside bounds of acceptable bias, 

b 

"With the excepcion of ACH2 and SAT2p variables within categories are 
ordered on the basis of observed bias. (See tabl© 6.9). 
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variables is eliminflted and ^1 the Category I variables except SRAA2 
are dropped. 

D, PrediGted Bias vs* Obsorved Bias 

The results of the predlctipns in the regression of ACH on SAT 
are. as satisfactory as the results in the earlier eKample, The 
prediction from IDl grouping is again among the mast errant • In 
general^ however, grouping characteristics which produce good estimates 
can be selected on the basis of predicted bias^ especially when the 
standard errors of the grouped estimates .are also taken into account, 

IV, Summary of Empirical Results 

We set out In Chapter 6 to demonstrate the utility of the grouping 
concepts and methods developed in Chapters 3 and 4 under realistic 
empirical conditions. The empirical evidence -regarding the estimation 
of conformed to the predictions from the principle of incorpora- 

ting the grouping characteristics as variables in the structural models 
which, in turns lead to the taxonomic categorization of grouping 
variables. The latter classification resulted in clusters of readily 
identifiable ■■good" and **bad" grouping variables under most aggregated 
conditions. We further showed that if the investigator formed a 
weighted composite of estiinates from several of his best grouping 
variables, his resulting estimate is Invariably highly accurate. 

Thus we demonstrated some effective strategies of estimating sim-- 
pie linear regression coefficients (and aero^order correlation 
coefficients) when data aggregation is under the Investigator's control 
and the grouping characteristics under consideration have at least an 
interval scale. To a certain degree, our results are generalizable to 
naturally aggregated data where same degree of disaggregation is 
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feasible. The possibilities of utiliEing nominGlIy scaled grouping 
characteristics were discussed in Chapter IV, but the procedures 
suggested for euch variables were not demonstrated empirically. 
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CHAPTER 7 



Sm^lARY MD CONCLUSIONS 

1* Sumniary of Findi ngs ^ - 

We have exaTnined certain consequences of estimating regreaslon 
coefficients at the level of individuals from aggregated data. In 
Chapter I5 various research contexts in which such questions arise were 
described and the main emphasis of our investigation was identified/ 

In Chapter 2^ we reviewed previous literature on grouping in the 
two-variable case. The literature on estimating both correlation 
coefficients and regression coefficients was considered. 

In Chapter 3 we discussed the various factors which affect the 
estimation of the simple linear regression coefficient and zero-'Order 
correlation coefficient when data are grouped on some interval variable. 
With one exception, it was assumed throughout that there were no 
measurement errors in X . Though speaking in terms of "structural 
equation models'* is somewhat awkward when there are only two. variables 
involved, this term was used because the bivariate regression was 
simply a special case of a multivariate structural model. 

We first demonstrated that the estimate of B C^vv^ from grouped 
data is unbiased if the assumptions regarding the disturbances in .the 
simple model used by earlier investigators are satisfied. However, the 
slope estimates from grouped data were shown to be less efficient than 
the estimates from ungroupod data. This finding led to the criterion 
of maKimlzation of the between-groupF variance (minimigation of the 
withln-group variance) of the independcmt variable as an appropriate 
method of judging tho efficiency of alternative grouping prbcedures. 
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The investigaLion was then expanded to consider in greater detail 
the concept of grouping by a '^grouping variable". This logic suggested 
that the criterion by which the individual observations are to be 
grouped can be treated as a random variable which may be related to 
other variables in the structural equation system. FurtherTnore ^ the 
system specified that the grouping variable Z , if related to another 
variable, is prior to that variable. The alternative relations of the 
grouping variable to the dependent and independent variables were then 
used to generate a f our-^category taxonomy which included all grouping 
variables satisfying. a specific set of relational restrictions imposed 
by that category. 

The estimates from data grouped by Category I variables (Z related 
to both and X) were found to biased. This apparent disagreement 

between the simple model and our alternative structure can be eKplained 
by the misspeclf Ication of the simple model when the grouping variable 
Is directly related to both dependent and independent variables* 
Further examination of this phenomenon led to a recoinmendation that the 
relation between the grouping variable and the dependent variable be 
ininimlzed. Grouping on Gategory I or Category II variables was 
discouraged because such variables are directly related to Y'X , and 
few variables can be expected to meet ^the necessary criteria that Z 
be unrelated to X and EX^ be nonzero at the same time. 

The relative efficiencies of variables from the different ^patigor-- 
les were also exatiiincd. It was determined that Category III grouping 
variables (Z related to X but not to Y^XT'^yield the most efficient 
grouping procedures so long as , there are variables with efficiencies 
greater than Category IV variables [whose efficiencies are on the order 
of (m-l)/(N-l) where m is the number of groups and N , the total 
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number of obsGrvations . J . It was suegasted that for certain values of 
^YZ*K ^XZ ' Cauegory I grouping, though slightly biased, can yield 

more efficient estimates than either Category II or Category IV grouping. 

We examined the possible aauses of variations in the magnitude of 
bias and the relative efficiency within categories of the taxonomy and 
the special problems in grouping by nominal characteristics in Chapter 4. 
The within-variable properties considered were (1) the coarseness of 
groupings (2) the distribution of observations among the groups, and 
(3) the distribution of the independent variable within and among the 
groups. As might be expecteds the most efficient estimates were found to 
coincide with variables that generated a large number of maximally dis- 
crete and compact groups. 

We also considered ways of applying "structural equation" methods 
with nominal g^^^uplng characteristics in Chapter 4. A classification 
scheme proposed by Wiley was discussed wherein grouping variables are 
categorized by their scale (nominai or interval) and by whether the groups 
in the study are the entire population (fixed) or only a sample from the 
population of interest (random) * The nominal grouping variable l!^ was 
viewed as a surrogate for an underlying grouping variable has a 

metric. Though Z is latent and unmeasurable, it can be estimated by 
classification procedures describing group differences in Z , Sampling 
bias was said to affect grouped estimates when the classes of the group- 
ing variable are unrepresentative of the population. 

Dummy coding procedures used by economists were suggested as a way 
to incorporate the nominal grouping characteristic in our models* Dummy 
coding is less time-consuming and complex than Wiley -s procedure* It 
yields functions which can be compared directly with the parameters 
generated by ordered grouping charaeterlstlcB. 
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In ChapUar 5, we described vririous proceduras for analyiilng the 
effects of grouping. in the multivariate case. Of" particular interest 
was a statistic developed by Feige and-Watis (1972) for assessing the 
divergence between grouped and ungrouped regression coefficients. Also* 
we showed that the results from the ^^atructural e:quation" approach in 
the two-regressor ca^e^agreed with the findings in the fingle-regressor 
caae, Extension of the results from the '^structural equations" approach 
to more than two regressors is straightforward. However^ the analysis 
rapidly becom^es -compiicated with additional regressors because of the 
necessity to specify , the structural relations among all variables in 
the model (including- the^vS^o'^pl^S Variable) * - ^ — - 

Empiricar examples of grouping in the single-regressor case were 
presented in Chapter 6* In general , the results conformed to our expec- 
tations and the predictions from the structural equations approach were 
reasonably accurate The use of weighted composites of estimates from 
different grouping methods was demonstrated. These weighted composites 
were recoTranended as a possible means of estimating coefficients when in- 
formation on certain ^prlmary variables is 'Jipllected anonymously. 

Irtien the wlthln-category effects of the,^ different factors are com- 
bined with our knowledge of the category and scale differences^ several 
principles evolve for selecting a grouping variable which minimises bias 
and maxinv^ 'es efficiency. A partial list of these principles in the 



single-'regressor-casa includes the following i 

' / A» To obtain unbiased estimates of the' linear regression coeffi- 
cientj choose a Z so that (in order of preference) 
1) Z is related to X-but not to. Y'X (Category III) , 
i 2) Z is hot related to either X. or. ^ Y (Category IV) ^ 
or 3) Z is related to Y*X but not to X (Category II). 
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Cstegr 111 variables are praferable because they yield 
generally efficient estimators because tha between-group 
variation In the regressor is maximized. 
' Vfhen biased estimatea are the only alternative, choose Z so 

that 

1) g is as large as possibles 

2) g „ is as small as .possible , 

3) % is stnaller than g ^ and 

4) the. ratio ^g/^g approaches as near as pbesible the ratio 

C, The efficiency of the grouped estimator increases as 

1) m approaches' N i or 

2) average n incraases when random measurement errors in X 
are possible ^ but decreases otherwise ; or 

3) the correlation ratio approaches unity; or 

4) the pooled within--group variance in the independent 
variable becomes smaller | or 

5) the degree of overlap among the wlthin'-grQUp distributions 
of the independent variable decreases* 

There are obviously other intangibles that cannot be dealt with by 
general principles* There Is always the problem of degree of Investi-- 
gator control over the grouping process* As stated earlier ^ anonymous 
collection of data on some primary v^arlables st "lous2y complicates 
matters as does adding more regressors* We have tried to identify only 
the strategic aspects of the process of determining the effects of 
grouping and have left to future investigations the practical details 
of application* Proper application of these principles requires that 
the investigator thoroughly understand the theoretical model in 
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quest ion ^ and no set: of guidelines can adequately eiisui'e that: Chls will 
occur. 

1 1 . Siiggestions for Further Invgstigation 

At a number of points ^ we have noted areas where the present state 

of knowledge on the complications due to data aggregation is weak and 

further investigation is warranted* Here we indicate several of the 

more interesting and pressing questions. 

1* Nominal Grouping Characteristics — In the introductory 

chapter, we described five research problcins in which aspects 
of data aggregation are encountered* The discussion that 
followed focused almost entirely on questions that arise in 
two conteKts [ (C) econoiny of analysis and (D) anonymously 
collected data] . Perhaps the most important question from 
the perspective of educational researchers how to determine 
the effects of grouping on a nominal characteristic such as 
school [problem (E) ] * Our treatment of nominal variables in 
Chapter 4 merely provides some suggestions about how this work 
might proceed. Much more research is necessary to determine 
the special complications that arise in predicting the effects 
of grouping on a nominal characteristic * 
2* Missing Data and Measurement Errors The suggested utiliza^ 
tion of groupiiig In handling problems with missing data 
[problem (A)] and measurement errors tproblem (B) ] requires 
further elaboration and Investigation. 
3* Weighted Composites and Anonyniou^sly Collected Data The 

description of the use of aggregated data to overcome complin- 
cations with anonymously collected data [problem (D) ] and the 
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subsequent example which used weightGd composites of estimates 
frc^-^ grouped data to estimate cosfficients represent a poten- 
tially valuable new field for planned application of data 
aggregation. More work is necessary to establish the general- 
ity of the technique of estimating Individual-level relations 
from weighted composites of between-group coefficients* 
4* Multivariate Models — A more thorough investigation of the 
effects of grouping in models with multiple regressors is 
highly desirable* The comparative utility of the "structural 
equations" approach arid the procedures suggested by Peige and 
Watts needs to be investigfc.w.ed . Additionally^ hardly anything 
is knox^n about the optimal grouping method when the hypotheses 
of interest posit some form of simultaneity of causation in 
luultivariate models. 
5* Aggregation Over Time — An investigation of whether principles 
devaloped here apply when the grouplhg variable is some time 
interval ("year", "occasion") fcould be of value* The results 
may provide new insight lato the partitioning of observation 
periods in classroom process studies • 
6. Appropriate Model Specification — We have purposely focused 
on the conditions under which the estimates from grouped data 
. provide accurate or misleading inforraation about relations 
among measurements on individuals. It is evident that the 
principles governing aggregation bias are a subset of the 
problems that appear in econometrics literature under the 
heading of "specification bias", Tl\e necessary Interrelation 
between specif Icat ion bias and aggregation bias needs to be 
elaborated and communicated to the educational research 
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ct ,.iiiiuiiity p Thin Glaboration would nocessarily include warnings 
about tha potential hazards of accepting global measures of 
associatiDn (a.g,^ individual^level correlacions) as accurately 
rei'lectlng the actual proceBses in operation, Wien there exist 
group-to-group differences on the primary variables ^ it is often 
more appropriate to conduct within-group analyses or to include 
additional variables that account for group differences in the 
model. This latter kind of a specification problem suggests 
the interface between the analysis of covariance procedures and 
the analysis of grouping effects. 
?• Multilevel Ana lyjjLs In the literature on school effectSj 

investigators have begun to recognize that it may be necessary 
to adjust for the lack of independence among students within 
classrooms, PrDcedures ^ that combine within-class or within- 
school analysas with analyses at a higher level of aggregation 
deserve more attention* 
. This list is^Jby^ no means complete , Howey.er ^ it does accurately 
reflect the concerns over data aggregation in educational rasearch and 
directions for further inquiry by educational researchers. 
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